Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Apr 14.
Published in final edited form as: J Acoust Soc Am. 2026 Feb 1;159(2):1201–1209. doi: 10.1121/10.0042379

Cross-frequency interactions in band importance functions

Adam K Bosen 1,a), Anastasia J Rogers 1, Ryan W McCreery 1, Emily Buss 2
PMCID: PMC13072708  NIHMSID: NIHMS2163682  PMID: 41661024

Abstract

The Speech Intelligibility Index (SII) is a metric of the amount of information available in a degraded or masked speech signal. The SII is used to predict speech recognition outcomes and is part of hearing aid prescription formulae. A critical assumption in the calculation of the SII is that frequency bands contribute independently to speech recognition, i.e., the importance of a band does not change based on the context of speech cues in other bands. Prior work has challenged this assumption by demonstrating that pairs of bands can contain synergistic or redundant information. The present work extends these findings by directly measuring pairwise interactions between the 21 frequency bands defined by the Critical Band Procedure of the SII. Forty-one participants with normal hearing identified words filtered to contain pseudorandom combinations of four or five bands. Pairwise interactions indicated both synergy and redundancy and accounted for substantial variability in recognition accuracy. The importance of individual bands decreased when pairwise interactions were considered, with the largest decreases for frequency bands above 1 kHz. The spectral proximity and envelope correlation between pairs of bands predicted whether their combination was synergistic or redundant. Interactions between bands play a critical role in speech recognition.

I. INTRODUCTION

The Speech Intelligibility Index (SII) (ANSI, 1997) is a widely used metric of speech audibility that plays a prominent role in audiological prognosis and intervention. The SII is calculated by dividing the speech spectrum into discrete bands, multiplying the audibility of the acoustic signal in each band with the importance of that band, and summing the result. An assumption that is inherent to this calculation is that importance is independent across bands, such that contribution of any one band to the index is unaffected by the audibility of any other band. Several studies have demonstrated that this assumption is false (Humes and Kidd, 2016; Kryter, 1960; Warren et al., 2005). Presenting spectrally separated frequency bands together can produce superadditive or subadditive increases in speech recognition accuracy relative to the presentation of those frequency bands in isolation. The goal of the present work is to extend these prior results by directly measuring the importance of pairwise interactions between frequency bands.

The established standard for estimating band importance is to filter speech with a range of low-pass and high-pass cutoff frequencies and measure the effect of removing portions of the spectrum from the acoustic signal on speech recognition accuracy (ANSI, 1997, 1969). If a frequency band is removed and speech recognition substantially decreases, that band is inferred to have high importance. Conversely, if a band is removed and speech recognition accuracy is unaffected, that band is inferred to have low importance. The problem with this approach is that low-pass and high-pass filtering conflates the effect of a band with the fixed context of all lower or higher frequency bands that are also present, so that the effect of removing a band is to both eliminate access to cues within that band and access to cues that are conveyed by integrating acoustic cues across that band and the other present bands. An improvement to this technique was proposed by Kasturi et al. (2002), who filtered consonant and vowel stimuli to remove all combinations of one or two octave-wide frequency bands. This approach averages contextual interactions across frequency bands out of estimates of importance. Apoux and Healy (2012) further refined this approach by filtering stimuli into a larger set of one equivalent rectangular bandwidth (Glasberg and Moore, 1990) wide bands and removing the majority of those bands from stimuli. Removing a larger proportion of the spectrum reduced accuracy to the point where it was sensitive to small changes in spectral content, and dividing stimuli into a greater number of narrower bands revealed a detailed structure in importance across the spectrum. This approach was also used to estimate the average importance of each band across its interactions with cues in all other bands for a variety of stimuli (Healy et al., 2013). Subsequent work elaborated on this approach by demonstrating that it was feasible to estimate individual differences in band importance across listeners with different hearing status and ages (Bosen et al., 2025; Bosen and Chatterjee, 2016; Shen and Kern, 2018; Yoho and Bosen, 2019).

Prior experiments have found that frequency bands may convey more, the same, or less information in conjunction than the net contribution that each band provides in isolation. When multiple bands are presented together, speech recognition accuracy can be greater or less than the sum of recognition accuracy when either band is presented in isolation (Kryter, 1960; Müsch and Buus, 2001; Warren et al., 2005). Superadditive combinations indicate that those bands contain synergistic information, whereas subadditive combinations indicate that those bands contain redundant information. Similarly, many different combinations of frequency bands can have approximately equal sum importance and thus are predicted to yield approximately equal recognition accuracy. However, in practice combinations of frequency bands with equal importance can yield unequal speech recognition accuracy, with the greatest accuracy for combinations that space bands evenly across the spectrum and lowest accuracy for combinations that cluster bands in one portion of the spectrum (Humes and Kidd, 2016). Here, we directly compare pairwise cross-frequency interactions to the average effect of each band, with the expectation that cross-frequency interactions will substantially contribute to speech recognition accuracy.

If such cross-frequency interactions are large relative to the average importance of each band, then it is possible that the importance of frequency bands will change when cross-frequency band interactions are accounted for. Some speech cues are contained within a band (e.g., spectrotemporal contours within a frequency band), whereas others are contextually dependent on information in other bands (e.g., the location of the first formant relative to the second) (Strange, 1989). We will examine whether the importance of bands changes depending on whether cross-frequency band interactions are considered.

Speech is organized into covarying acoustic fluctuations across bands that naturally cluster into distinct frequency bands that do not match the band limits defined by the SII (Bosen et al., 2024; Ueda and Nakajima, 2017). Pairs of frequency bands that are proximal in the spectrum and/or covary in acoustic level are more likely to contain more redundant information that bands that are distal from one another (Müsch and Buus, 2001; Warren et al., 2005) or do not covary. If bands contain redundant information, then their combination should have less importance than the sum importance of those bands, which manifests as negative importance for band combinations. Thus, we predict that the importance of cross-frequency interactions for speech recognition accuracy will be positively correlated with spectral distance and negatively correlated with cross-frequency envelope correlation.

Recent work has demonstrated that band importance can be estimated with a few hundred speech recognition trials (Shen and Kern, 2018). In contrast, joint estimation of both the average importance of each band and all pair-wise cross-frequency combinations introduces many more parameters that need to be estimated, so a larger quantity of data is needed to precisely estimate these additional parameters. In the present study, we opted to exhaustively test all combinations of frequency bands. However, this is an arduous amount of data collection for future studies, so we also examined how reliability of cross-frequency interaction estimates varies with the amount of data collected. We anticipate that our full data set will be highly reliable, but that fewer data are needed to estimate cross-frequency interactions with adequate reliability.

II. METHOD

A. Participants

Forty-one young adults (range of 19–29 years, mean of 23.2 years) who had normal hearing, were native speakers of American English, and had no known cognitive deficits that would interfere with task completion participated in this study. All participants were screened for normal hearing, defined as air-conduction thresholds at < 20 dB HL bilaterally across frequencies from 0.25 to 8 kHz. Among those participants, six had a threshold of 25 dB HL in one ear at a single frequency, with the affected frequency varying across individuals. Of these participants, 30 identified as female (73.2%), 10 identified as male (24.3%), and 1 declined to report (2.4%). Participants were recruited by contacting individuals who registered for Boys Town’s Research Administration Database or responded to advertisement on Boys Town’s website, and via word of mouth. All data were collected at Boys Town National Research Hospital. Participants were compensated at $20/h. The study was approved by the Boys Town National Research Hospital Institutional Review Board.

B. Stimuli

Participants heard word stimuli that were filtered to remove all but a subset of frequency bands and repeated aloud what they heard. Target words were 500 monosyllabic consonant-vowel-consonant words taken from the Minimum Speech Test Battery (MSTB, 2011). Those stimuli were produced by a male talker and recorded at a rate of 44.1 kHz at 16-bit depth. Stimulus presentation level was 65 dB SPL prior to filtering. To facilitate comparison to prior studies, we filtered these words into the 21 frequency bands defined by the ANSI SII Critical Band procedure (ANSI, 1997). Each filter was a bandpass 2000-order Hamming window finite impulse response filter, and stimuli were passed through filters in forward and reverse order to avoid any phase distortion.

To ensure that our experiment was sensitive to the presence or absence of each frequency band, average task accuracy needed to avoid floor or ceiling, ideally around 50%. Additionally, the number of bands selected for each trial needed to have at least two values to ensure the experiment design was not rank-deficient. Prior work indicates that selecting 4 or 5 of 21 bands should yield around 30%–45% word recognition accuracy on average (Healy et al., 2013), a solution which addresses both design constraints. For trials that include 4 out of 21 bands, there are 5985 unique band combinations, and for trials that include 5 out of 21 bands there are 20 349 unique combinations. Of these unique combinations, it was not evident how many of them—or which ones—would be required to reliably estimate cross-frequency interactions, so we decided to exhaustively test all combinations.

This large number of combinations would be challenging to test within participants, so we decided to split testing across multiple participants and combine results. To maximize the amount of data collected per participant without inducing excessive fatigue or requiring multiple experimental sessions, we decided that the maximum number of word recognition trials per participant would be 1000. Dividing these trials into an equal number of 4 band and 5 band trials yielded 500 5-band trials per participant, so exhaustive testing of all combinations required 41 participants to complete 41 000 trials.1 We generated pseudorandomized sequences of band combinations with the goal of balancing the number of times each participant heard each band and each pairwise combination of bands. Each word in the MSTB was presented to each participant twice, once with 4 bands and once with 5 bands. The assignment of words to band combinations was randomized across participants.

C. Procedure

Participants provided informed consent, completed a hearing screening, and then completed the speech recognition task. They sat in a double-walled sound attenuated booth at a desk that supported a computer monitor, mouse, keyboard, loudspeaker (Genelec 8330APM), and microphone (Audio-Technica AT2020). Participants were seated approximately 1 m from the loudspeaker. Stimuli were presented and verbal responses were recorded through a custom matlab (Mathworks, Natick, MA) interface controlling an external audio interface (MOTU M4). Participants were familiarized with the procedure and instructed to listen and repeat monosyllabic words that were filtered to be hard to understand. The 1000 trials per participant were divided into 10 blocks of 100 words each. Participants were encouraged to take breaks as needed between blocks. Each participant completed their trials in one experimental session that lasted between about 60 and 90 min.

Stimulus playback was initiated by participants clicking a button on the computer, so the task was self-paced. The carrier word “Ready” was presented 500 ms prior to the onset of the target word. This carrier word was used as an attentional cue and was filtered to contain the same combination of frequency bands as each target word. The stimulus was played once, and the participant repeated what they heard, excluding the carrier word. No feedback was provided. An experimenter with normal hearing sat outside of the sound booth and listened to the responses through headphones. Responses were recorded and scored by that experimenter in real time, with responses that matched the target word marked as correct, and all other responses marked as incorrect. Incorrect responses were phonetically transcribed by the experimenter, either during the experiment or when listening to audio recordings for responses that were not transcribed in real time. For one trial, a response was not transcribed in real time and a software error lost the voice recording, so that trial was omitted from the data analysis. To validate response transcriptions, a second lab member with normal hearing transcribed one randomly selected block for each of the first 20 participants (2000 words total). Those results were compared with the original transcriptions. Cohen’s Kappa was calculated between transcriptions, with a value of κ = 95.5%, indicating good agreement across transcribers. Edit distance was used to calculate the number of edits required to transform the response to the target word (Gonthier, 2022). Recognition accuracy was scored as three phonemes minus the number of edits, with a minimum of zero.

D. Data analysis

To estimate the importance of each frequency band and pair-wise interactions between bands, a generalized linear mixed effects model with a beta-binomial family was used to predict speech recognition accuracy based on which bands were present in the stimulus. Model fitting was performed in r (R Core Team, 2024) using the brms package [version 2.22.0 (Bürkner, 2017)] and Stan (Carpenter et al., 2017) using the RStan interface (version 2.32.7). Response accuracy ranged between 0 and 3 phonemes correct per word and was regressed against the presence or absence (coded as 0.5 or −0.5) of each frequency band in the target word. The beta-binomial family uses a logistic link function to relate a discrete outcome (accuracy) to a set of predictor variables (the presence of each band). This approach quantifies band importance for speech recognition as the log-odds of accurately identifying phonemes in words when that band is present. In addition, the beta-binomial family contains an additional free parameter that quantifies the intraclass correlation of response accuracy within words, which we interpret as a measure of context effects in real words (Bosen, 2024). Two models were compared. The first model only included the main effect of each band on recognition. We have previously used similar models to estimate the main effect of each band in logistic regression to measure band importance in a variety of participants (Bosen et al., 2025; Bosen and Chatterjee, 2016; Yoho and Bosen, 2019). The second model included both main effects and pairwise interactions between bands. Mean posterior values for all model coefficients were calculated and compared across models. Model goodness of fit was compared using Pareto smoothed importance-sampling leave-one-out cross-validation (PSIS-LOO) (Vehtari et al., 2017). This approach calculates expected log pointwise predictive density (ELPD). Differences in ELPD between models that are greater than four indicate that one model is a substantially better explanation for the data (Sivula et al., 2020). The r package that calculates ELPD also estimates the effective number of parameters needed to explain the data based on how well the model can predict the distribution of data. The effective number of parameters is expected to be less than or equal to the true number of parameters if the model is well-specified to describe the observed distribution of data.

In the pre-registration of this study (Rogers et al., 2025), we proposed to include random intercepts per participant and per target word in these models. Random intercepts per participant were included in the results presented below, but we realized post hoc that the assignment of random band combinations to target words and the relatively small number of times each word was presented (82, twice for each participant) would make estimating 500 additional random intercepts per word infeasible and challenging to interpret. Given that prior studies of band importance, to our knowledge, have not attempted to quantify variability in recognition accuracy per stimuli and that these random effects were not a primary focus of our study, we opted to omit them.

To examine whether envelope correlations across frequency bands predicted the importance of cross-frequency interactions, we concatenated the audio waveforms for all words, filtered the concatenated waveform into bands, then calculated the pairwise correlation of envelope level between bands. Envelope level was calculated via the Hilbert transform and converted to a log scale. This analysis was conducted in matlab.

To estimate the reliability of model coefficient estimates, we used Monte-Carlo resampling with replacement (Pronk et al., 2022). The data were randomly resampled twice, and the model with pairwise interaction terms was fit to each sampled dataset. The importance of each band and pairwise combination of bands was calculated for each dataset, and the correlation of importance values was computed to provide an estimate of reliability. This process was completed for sample sizes of 41 000 (the size of data collected here), and at values of 20 000, 10 000, 5000, and 2500, which are roughly fractional values of the total amount of data. The resampling algorithm was run ten times per sample size to obtain multiple estimates of reliability within a reasonable amount of time required to fit the model to resampled datasets. This procedure took about 50 h to complete on a modern PC.

III. RESULTS

The left panel of Fig. 1 shows band importance for the model that included only main effects. Band importance follows expected trends, with peak importance between 1 and 2 kHz and the lowest importance at the upper and lower ends of the spectrum. All bands had positive importance, indicating that the presence of each band contributed to speech recognition to some extent. For comparison, band weights from the ANSI SII for NU6 words were scaled to match the range of importance in the current study. Trends are broadly consistent between the standard and our results, although the shape of the function differs to some extent.

FIG. 1.

FIG. 1.

Band importance for MSTB CNC words when only the main effect of each band is included in the model (left) and when pairwise interaction terms are also included (right). For qualitative comparison of trends between our results and established standards, band importance for NU6 words from the ANSI SII are shown scaled to match the range of importance values obtained here for the model that included only main effects. For the model that included interactions, values along the diagonal show the importance of each band by itself and values off the diagonal show the importance of pairs of bands, with values mirrored above and below the diagonal.

The right panel of Fig. 1 shows band importance and pairwise interactions across bands for the model that included main and interaction effects, indicated by color. Weights for individual bands are shown on the diagonal, and the remainder of cells reflect pairwise interactions between bands. For interactions between bands, positive coefficients indicate synergy, i.e., hearing that combination of bands had a superadditive effect on recognition accuracy, with better performance than expected based on the information provided by each band on its own. Negative coefficients indicate redundancy in speech cues conveyed by pairs of bands, and coefficients near zero indicate independence. The largest positive importance of band pairs shown in this plot corresponds to interactions between bands around 1 and 2.5 kHz, where the weight of pairs of bands exceed the sum of each band. Pairs of bands that are proximal to one another tend to be redundant, with apparent clusters of redundancy for pairs of bands between 500 Hz to 1 kHz and bands > 3 kHz. Band importance (values on the diagonal) was highest for bands around 1 kHz and lowest for bands at the upper and lower ends of the spectrum. In addition to examining weights for total phonemes correct in words, we also examined weights for each phoneme separately, as described in the Appendix.

Random intercepts varied across participants, with estimated group level standard deviations of σ = 0.17 and σ = 0.18 in log-odds for the main effect model and the model with interaction terms, respectively. Estimated phoneme recognition accuracy ranged between 55% and 72% across participants, with a mean of 65% (43% accuracy for whole words). Both models account for variability in the assignment of band combinations to participants via fixed effects that are shared across participants, so this variability in accuracy across participants likely reflect factors that are intrinsic to participants rather than task design. Context effects, quantified by estimating intraclass correlation, were nearly identical across models (ρ = 0.36 for the main effect model and ρ = 0.35 for the model with interaction terms). These values are similar to the value reported by Bosen (2024) (ρ = 0.35) for MSTB CNC words in individuals with cochlear implants.

Table I shows the comparison of goodness of fit across models. The model which included both main effects and interaction terms yielded a better model fit (greater ELPD), despite the large number of added parameters. The effective number of parameters in each model was estimated to be smaller than the true number,2 indicating that both models were well specified for the data.

TABLE I.

Comparison of goodness-of-fit across models.

Model ELPDLOOa ELPDLOO SEb pLOOc pLOO SE ΔELPDd ΔELPD SE
Main effects only −52 200 98 60 0.2 −531 38
With interactions −51 668 104 265 1.1
a

Expected log posterior density (ELPD) estimated with leave-one-out cross-validation (LOO).

b

Standard Error (SE) of the associated metric.

c

Estimated effective number of parameters in the model (p).

d

Difference in ELPD between models.

Figure 2 compares band importance across models. This figure replicates the importance function when only main effects are included shown in the left panel of Fig. 1 for comparison to the main effect of each band when interaction terms are included (i.e., the diagonal values). As shown, band importance decreased for almost all bands when interaction terms were included. Band importance was negative for the 100–200 Hz band and near-zero or negative for bands above 3000 Hz in the model with interaction effects, whereas importance for these bands was positive in the model that included only main effects. These bands had synergistic interactions with other bands, which accounts for the difference between models. Importance for individual bands was moderately correlated across models (r = 0.57). The fact that this correlation is not close to 1 indicates that the importance of interactions between bands can be misattributed to the independent importance of either band when such interactions are not considered.

FIG. 2.

FIG. 2.

Band importance from the model that only included main effects (in black) compared to band importance from the model that also included interaction terms (in color).

Figure 3 shows the correlation of envelope level between frequency bands (left panel), along with the relationship between importance of band pairs and two predictors: envelope correlations and spectral distance between band center frequencies. In agreement with previous work (Bosen et al., 2024; Ueda and Nakajima, 2017), frequency bands clustered into regions with highly correlated envelopes, with clusters evident for frequences below 500 Hz, between 500 and 2000 Hz, and above 2000 Hz. These clusters appear qualitatively similar to the clusters of redundant band pairs in Fig. 1. To quantify this observation, the importance of band pairs was regressed against envelope correlations. As shown in the top right panel of Fig. 3, there appears to be a nonlinear relationship between importance of band pairs and envelope correlation. The importance of band pairs decreased with increasing acoustic envelope correlation for bands with correlations ≳0.7, whereas correlations below this value were not strongly associated with importance. A post hoc decision was made to fit an exponential function to the data. The bottom right panel of Fig. 3 shows a similar trend between importance of band pairs and spectral distance between bands (in octaves). Importance increased with increasing spectral distance for bands closer than ~1 octave to one another, whereas importance was not strongly associated with distance for bands with center frequencies more than 1 octave apart. Band pairs with the highest importance (i.e., synergistic bands) tended to have moderate envelope correlations and center frequencies around 1–2 octaves apart. Comparing goodness of fit using each variable to predict importance of band pairs indicated that spectral distance was a better predictor of band pair importance (ΔELPD = −42.6, SE = 10). Envelope correlation and spectral distance were highly correlated (r = −0.84), so even though spectral distance was a better predictor, they both have similar overall ability to predict which band pairs are redundant.

FIG. 3.

FIG. 3.

Correlation of acoustic envelope across frequency bands (left) and comparison of envelope correlation and spectral distance with the importance of band pairs (right). Lines show exponential curves fit to the data.

Figure 4 shows the reliability of importance values estimated with Monte-Carlo resampling of the data. Quantified as the mean correlation between pairs of estimates, reliability of resampled data were high for 41 000 points (r = 0.81). Reliability declined for 20 000 or 10 000 points (r = 0.67 and r = 0.61, respectively), and fell rapidly for 5000 or 2500 points (r = 0.48 and r = 0.26, respectively).

FIG. 4.

FIG. 4.

Reliability of importance estimates as a function of number of samples in resampled data. Points show the estimated reliability from one pair of resampled data, and the line shows average reliability across all runs for each simulated number of trials.

IV. DISCUSSION

The results of this work demonstrate that pairwise interactions between bands make substantial contributions to speech recognition. Including pairwise interactions reduced the importance of each band by itself, suggesting that a substantial portion of previously reported band importance is attributable to the mean interaction across bands. Band pairs tended to convey redundant information when they were close in frequency and had correlated envelopes. Our results are in agreement with and expand upon previous findings.

As expected, the model which included only main effects replicated previously established trends (Fig. 1, left), in which importance was highest around 1–2 kHz and was lowest at the high and low ends of the spectrum (ANSI, 1997, 1969). The microstructure of the importance function differed from the ANSI standard, which is likely due to differences in speech stimuli and experimental paradigm (Apoux and Healy, 2012; Healy et al., 2013).

Including pairwise interaction terms in the model provided better prediction of word recognition accuracy (Table I). Band pairs had both positive and negative importance (Fig. 1, right), indicating that some pairs had synergistic interactions and others conveyed redundant information. In particular, the pairs with the highest importance were between bands around 1 and 2.5 kHz. Frequencies in these bands convey vowel formants (Peterson and Barney, 1952) and vowel-consonant coarticulation cues (Strange, 1989), which are most informative in conjunction (e.g., together, the first and second vowel formants convey vowel identity). In addition, some individual bands that had near-zero importance, such as the band around 3400 Hz, have positive importance interactions with other bands. Thus, while these frequences may not contribute to speech recognition in isolation, the information that they contain combines synergistically with information from other frequency regions to support recognition (Warren et al., 2005).

The decrease in importance for individual bands when interaction terms were included (Fig. 2) indicates that some proportion of band importance in previous studies can be attributed to interaction between bands. The extent to which band importance dropped varied across bands, indicating that the presence of within-band and between-band speech cues varies across the spectrum. However, the overall shape of the importance function was preserved with the inclusion of interaction terms, with highest importance around 1 kHz and lower importance at the upper and lower ends of the spectrum. Thus, while including interaction terms better accounts for speech recognition, currently established band importance functions that do not include interactions are still a valid approximation.

Correlations between the acoustic envelope of frequency bands (Fig. 3, left panel) appear to be qualitatively similar to correlations observed for keywords embedded in short sentences (Bosen et al., 2024), suggesting that this structure of cross-frequency envelope correlations generalizes across speech stimuli [see also Ueda and Nakajima (2017)]. These correlations demonstrate that cues carried in the speech signal are clustered into bands, with the most obvious divisions between bands around 500, 2000, and 4000 Hz. Additionally, the acoustic envelopes of proximal frequency bands were correlated with one another across the spectrum. Comparing the importance of pairwise band interactions to cross-frequency envelope correlations and spectral distance (Fig. 3, right panels) supported our hypothesis that correlated and proximal frequency bands tend to convey redundant information and thus have lower importance [see also Müsch and Buus (2001) and Warren et al. (2005)]. This observation also accounts for why filtering out several narrow frequency bands dispersed throughout the spectrum has less negative impact on speech recognition than filtering out a single wideband of the spectrum (Humes and Kidd, 2016), as redundancy between proximal frequencies would compensate for narrow gaps in the spectrum. The largest synergistic interactions between bands were between bands around 1 and 2.5 kHz, but there are not obvious features in either envelope correlation or spectral distance that offer a clear reason for the location of these synergistic interactions. This disparity between synergy and acoustic features suggests that synergies may not arise solely from acoustic distinctiveness.

Monte-Carlo resampling of our data indicate that our results had a reliability of around r = 0.8 (Fig. 4). As expected, reliability decreased with a decreasing number of experimental trials, with slight declines for N = 20 000 and 10 000 and low reliability (r < 0.5) at smaller sample sizes. Participant responses were phonetically transcribed by experimenters, so errors in transcription may place an upper bound on reliability. For data that were transcribed by two experimenters, agreement was high but not perfect, which indicates that there is likely some limitation in reliability that arises from our experimental procedure. In addition, our estimates of reliability are probably conservative because we used completely random sampling to simplify the resampling procedure. Completely random sampling does not control for individual differences in accuracy or provide balanced sampling of each band and pair of bands. Selecting stimuli to maximize the information gained [e.g., Shen and Kern (2018)] would likely increase reliability relative to completely random sampling. Given the rate of data collection observed here (approximately 1000 trials per hour), even marked improvements in efficiency could require numerous test sessions to support assessment of individual differences in band importance using these methods. However, this amount of data collection is feasible to estimate band pair importance at a group level. The feasibility and reliability of this approach supports using it to measure the importance of band pairs for stimuli other than the words in quiet used here. Specifically previous studies have shown that band importance varies across stimuli (DePaolis et al., 1996; Healy et al., 2013), presentation level (Calandruccio et al., 2016; Studebaker and Sherbecoe, 2002), target talker (Buss et al., 2025), and listening conditions (Bosen et al., 2024; Jorgensen, 2025), so it seems plausible that our method could be used to identify differences in importance of band pairs across conditions as well. Given that the number of pair-wise interactions scales with the number of stimulus bands, it may be feasible to obtain more precise estimates in fewer trials if number of bands is reduced. For example, our use of the 21 bands in the critical band procedure of the ANSI SII standard yielded 210 interaction terms, whereas reducing the number of bands to 6 one-octave wide bands would only require 10 interaction terms. However, reducing the granularity of the frequency bands comes with the trade-off of potentially averaging out some of the fine structure evident in Fig. 1 and would alter the synergy and redundancy of band pairs.

The substantial effects of cross-frequency interactions on speech recognition accuracy indicate that the ability of the SII to predict speech recognition outcomes could be improved by incorporating such interactions. In patients with hearing loss, configuration of the audiogram and their aided audibility would affect the extent to which cross-frequency speech cues are available to the listener, so additional work is needed to test whether incorporating interactions into SII calculations substantially improves our ability to predict speech recognition outcomes in these patients. The presence of redundancies between proximal frequencies also suggests that there is an opportunity to selectively amplify only some portions of the speech signal, which could help reduce excessive loudness and thereby provide audibility of speech cues supporting recognition while staying below maximum comfortable loudness levels.

An important limitation of this study is that it used filtering to estimate frequency band importance, which is an acoustic manipulation that does not have an obvious real-world correlate. Alternative methods which examine the susceptibility of information in frequency bands to masking have been proposed (Buss and Bosen, 2021; Doherty and Turner, 1996), but filtering and masking conditions can result in different estimates of band importance (Yoho et al., 2018). Given the importance of cross-frequency interactions observed here and in previous studies, it seems worthwhile to also estimate cross-frequency interactions using noise masking techniques.

A second limitation is that this work was conducted in young adults with normal hearing. While these listeners are often treated as optimal listeners, there could nonetheless be individual differences within this group that affect speech recognition [e.g., Bosen and Barry (2020) and Bosen and Doria (2024)]. The impact of cross-frequency interactions may also depend on listeners factors including development, aging, and hearing loss. The importance of speech cues in some frequencies undergoes developmental changes in children (Bosen et al., 2025) so it seems plausible that the importance of cross-frequency cues would also change as children develop. Functional changes in the auditory pathway due to advancing age and/or long periods of hearing loss may also affect the encoding and subsequent use of specific speech cues. To our knowledge, changes in band importance have not been examined across the lifespan and our results may not generalize outside of the one decade of life that our participants spanned.

V. CONCLUSION

Pairs of frequency bands in speech may convey redundant or synergistic speech cues. Previously established methods of measuring band importance functions can be extended to estimate the importance of these frequency band pairs for speech recognition. Proximal frequencies tend to have some redundant information, whereas frequencies separated by one to two octaves tend to have high importance, synergistic interactions. Quantifying the importance of cross-frequency interactions in group data is feasible using the approach described here and has the potential to improve our ability to predict speech recognition accuracy.

ACKNOWLEDGMENTS

Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award No. P20GM109023.

APPENDIX

In addition to the primary analysis based on word recognition, the importance of each frequency band and pairwise interactions between bands were estimated for each phoneme separately. A generalized linear mixed effects model was used to estimate the main effect and pairwise interaction effects as described in the main text, except that accuracy for each phoneme was scored as a binary correct or incorrect response. To obtain phoneme accuracy, response transcriptions were compared to target transcriptions. If the initial/final phoneme of the response matched the target, then the initial/final consonant was marked as correct, otherwise it was incorrect. For the vowel, if the vowel in the target matched any phoneme in the response it was marked as correct. This scoring procedure was used to account for edge cases where the response lacked an initial or final consonant (e.g., “new,” “ooze”) and where the response was more than three phonemes long (e.g., “color,” “liked”).

FIG. 5.

FIG. 5.

Band importance for each phoneme in the MSTB CNC words when only the main effect of each band is included (top) and when pairwise interaction terms are also included (bottom). Figure format and interpretation match Fig. 1.

Figure 5 shows band importance and pairwise interactions across bands for each phoneme in the same format as Fig. 1. Band importance differed between consonants and vowels, with consonants having greater importance below ~500 Hz and above ~2 kHz and vowels having higher importance between 500 Hz–2 kHz. When pairwise interactions are modeled, positive interactions between band around 1 and 2.5 kHz and negative interactions between bands > 3 kHz are evident for all phonemes, which match the trends shown in Fig. 1. In contrast, greater redundancy between proximal frequency bands is evident for vowels than for consonants, particularly in the 500 Hz–2 kHz region.

Footnotes

Conflict of Interest

The authors have no conflicts of interest to disclose.

Ethics Approval

Informed consent was obtained from all participants. Data collection was approved by the Boys Town National Research Hospital Institutional Review Board (Protocol 17–05XP).

1

Unique combinations numbering 20 349 in the five-band condition divided by 500 trials per participant yields 40.7 participants, which we round up to 41 participants. 151 combinations were presented to two participants so that all participants completed the same number of trials.

2

The main effect model had 1 intercept + 21 band effects + 1 context parameter + 41 random intercepts per participant = 64 parameters. The model with interactions had 1 intercept + 21 main effect parameters + 210 interaction parameters + 1 context parameter + 41 random intercepts per participant = 274 parameters.

DATA AVAILABILITY

All data and analysis scripts used to produce the figures and statistics reported in this manuscript are available through Open Science Foundation at https://doi.org/10.17605/OSF.IO/VUB3H.

References

  1. ANSI (1969). American National Standard Methods for the Calculation of the Articulation Index (American National Standards Institute, New York: ). [Google Scholar]
  2. ANSI (1997). ANSI S3.5, Methods for Calculation of the Speech Intelligibility Index (American National Standards Institute, New York: ). [Google Scholar]
  3. Apoux F, and Healy EW (2012). “Use of a compound approach to derive auditory-filter-wide frequency-importance functions for vowels and consonants,” J. Acoust. Soc. Am 132(2), 1078–1087. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bosen AK (2024). “Characterizing correlations in partial credit speech recognition scoring with beta-binomial distributions,” JASA Express Lett. 4(2), 025202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bosen AK, and Barry MF (2020). “Serial recall predicts vocoded sentence recognition across spectral resolutions,” J. Speech. Lang. Hear. Res 63(4), 1282–1298. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Bosen AK, and Chatterjee M (2016). “Band importance functions of listeners with cochlear implants using clinical maps,” J. Acoust. Soc. Am 140(5), 3718–3727. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Bosen AK, and Doria GM (2024). “Identifying links between latent memory and speech recognition factors,” Ear Hear. 45(2), 351–369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Bosen AK, Frenette A, Spratford M, Lewis D, Walker EA, and McCreery RW (2025). “Developmental changes in speech frequency weighting in children,” Ear Hear. (published online). [DOI] [PubMed] [Google Scholar]
  9. Bosen AK, Wasiuk PA, Calandruccio L, and Buss E (2024). “Frequency importance for sentence recognition in co-located noise, co-located speech, and spatially separated speech,” J. Acoust. Soc. Am 156(5), 3275–3284. [DOI] [PubMed] [Google Scholar]
  10. Bürkner PC (2017). “brms: An R package for Bayesian multilevel models using Stan,” J. Stat. Software 80(1), 1–28. [Google Scholar]
  11. Buss E, and Bosen AK (2021). “Band importance for speech-in-speech recognition,” JASA Express Lett. 1(8), 084402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Buss E, Hall SCF, Garcia S, Capretta GS, and Calandruccio L (2025). “English band importance functions in competing speech for Spanish-speaking second-language learners of English,” J. Acoust. Soc. Am 158(2), 1367–1376. [DOI] [PubMed] [Google Scholar]
  13. Calandruccio L, Buss E, and Doherty KA (2016). “The effect of presentation level on spectral weights for sentences,” J. Acoust. Soc. Am 139(1), 466–471. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, Brubaker MA, Guo J, Li P, and Riddell A (2017). “Stan: A probabilistic programming language,” J. Stat. Software 76(1), 1–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. DePaolis RA, Janota CP, and Frank T (1996). “Frequency importance functions for words, sentences, and continuous discourse,” J. Speech, Lang., Hear. Res 39(4), 714–723. [DOI] [PubMed] [Google Scholar]
  16. Doherty KA, and Turner CW (1996). “Use of a correlational method to estimate a listener’s weighting function for speech,” J. Acoust. Soc. Am 100(6), 3769–3773. [DOI] [PubMed] [Google Scholar]
  17. Glasberg BR, and Moore BCJ (1990). “Derivation of auditory filter shapes from notched-noise data,” Hear. Res 47(1–2), 103–138. [DOI] [PubMed] [Google Scholar]
  18. Gonthier C (2022). “An easy way to improve scoring of memory span tasks: The edit distance, beyond ‘correct recall in the correct serial position,’ ” Behav. Res 55(4), 2021–2036. [DOI] [PubMed] [Google Scholar]
  19. Healy EW, Yoho SE, and Apoux F (2013). “Band importance for sentences and words reexamined,” J. Acoust. Soc. Am 133(1), 463–473. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Humes LE, and Kidd GR (2016). “Speech recognition for multiple bands: Implications for the Speech Intelligibility Index,” J. Acoust. Soc. Am 140(3), 2019–2026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Jorgensen E (2025). “Frequency importance functions in real-world noise for listeners with typical hearing and hearing loss,” J. Speech, Lang., Hear. Res 68, 4961–4977. [DOI] [PubMed] [Google Scholar]
  22. Kasturi K, Loizou PC, Dorman M, and Spahr T (2002). “The intelligibility of speech with ‘holes’ in the spectrum,” J. Acoust. Soc. Am 112(3), 1102–1111. [DOI] [PubMed] [Google Scholar]
  23. Kryter KD (1960). “Speech bandwidth compression through spectrum selection,” J. Acoust. Soc. Am 32(5), 547–556. [Google Scholar]
  24. MSTB (2011). Minimum Speech Test Battery (MSTB) for Adult Cochlear Implant Users 2011 New MSTB User Manual.
  25. Müsch H, and Buus S (2001). “Using statistical decision theory to predict speech intelligibility. II. Measurement and prediction of consonant-discrimination performance,” J. Acoust. Soc. Am 109(6), 2910–2920. [DOI] [PubMed] [Google Scholar]
  26. Peterson GE, and Barney HL (1952). “Control methods used in a study of the vowels,” J. Acoust. Soc. Am 24(2), 175–184. [Google Scholar]
  27. Pronk T, Molenaar D, Wiers RW, and Murre J (2022). “Methods to split cognitive task data for estimating split-half reliability: A comprehensive review and systematic assessment,” Psychon. Bull. Rev 29, 44–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. R Core Team (2024). R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, https://www.R-project.org/. [Google Scholar]
  29. Rogers AJ, Bosen A, McCreery R, and Buss E (2025). https://osf.io/nbyqu/.
  30. Shen Y, and Kern AB (2018). “An analysis of individual differences in recognizing monosyllabic words under the speech intelligibility index framework,” Trends Hear. 22, 233121651876177. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Sivula T, Magnusson M, Matamoros AA, and Vehtari A (2020). “Uncertainty in Bayesian leave-one-out cross-validation based model comparison.”
  32. Strange W (1989). “Dynamic specification of coarticulated vowels spoken in sentence context,” J. Acoust. Soc. Am 85(5), 2135–2153. [DOI] [PubMed] [Google Scholar]
  33. Studebaker GA, and Sherbecoe RL (2002). “Intensity-importance functions for bandlimited monosyllabic words,” J. Acoust. Soc. Am 111(3), 1422–1436. [DOI] [PubMed] [Google Scholar]
  34. Ueda K, and Nakajima Y (2017). “An acoustic key to eight languages/dialects: Factor analyses of critical-band-filtered speech,” Sci. Rep 7(1), 42468. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Vehtari A, Gelman A, and Gabry J (2017). “Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC,” Stat. Comput 27(5), 1413–1432. [Google Scholar]
  36. Warren RM, Bashford JA Jr., and Lenz PW (2005). “Intelligibilities of 1-octave rectangular bands spanning the speech spectrum when heard separately and paired,” J. Acoust. Soc. Am 118, 3261–3266. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Yoho SE, Apoux F, and Healy EW (2018). “The noise susceptibility of various speech bands,” J. Acoust. Soc. Am 143(4), 2527–2534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Yoho SE, and Bosen AK (2019). “Individualized frequency importance functions for listeners with sensorineural hearing loss,” J. Acoust. Soc. Am 145(2), 822–830. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All data and analysis scripts used to produce the figures and statistics reported in this manuscript are available through Open Science Foundation at https://doi.org/10.17605/OSF.IO/VUB3H.

RESOURCES