Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Mar 1.
Published in final edited form as: Hear Res. 2023 Jan 14;429:108697. doi: 10.1016/j.heares.2023.108697

Quantitative models of auditory cortical processing

Srivatsun Sadagopan 1,2,3,4,5,*, Manaswini Kar 1,2,3,, Satyabrata Parida 1,2,
PMCID: PMC9928778  NIHMSID: NIHMS1867983  PMID: 36696724

Abstract

To generate insight from experimental data, it is critical to understand the inter-relationships between individual data points and place them in context within a structured framework. Quantitative modeling can provide the scaffolding for such an endeavor. Our main objective in this review is to provide a primer on the range of quantitative tools available to experimental auditory neuroscientists. Quantitative modeling is advantageous because it can provide a compact summary of observed data, make underlying assumptions explicit, and generate predictions for future experiments. Quantitative models may be developed to characterize or fit observed data, to test theories of how a task may be solved by neural circuits, to determine how observed biophysical details might contribute to measured activity patterns, or to predict how an experimental manipulation would affect neural activity. In complexity, quantitative models can range from those that are highly biophysically realistic and that include detailed simulations at the level of individual synapses, to those that use abstract and simplified neuron models to simulate entire networks. Here, we survey the landscape of recently developed models of auditory cortical processing, highlighting a small selection of models to demonstrate how they help generate insight into the mechanisms of auditory processing. We discuss examples ranging from models that use details of synaptic properties to explain the temporal pattern of cortical responses to those that use modern deep neural networks to gain insight into human fMRI data. We conclude by discussing a biologically realistic and interpretable model that our laboratory has developed to explore aspects of vocalization categorization in the auditory pathway.

Keywords: Computation, modeling, quantitative analysis, neural networks, auditory cortex, vocalizations

1. Introduction

A critical requirement for extracting insight from experimental data is to place individual experimental observations and their inter-relationships in context within a well-defined framework. In this review, we describe how quantitative modeling can provide such a framework for gaining an understanding of auditory cortical processing from an ever-increasing mountain of electrophysiological, imaging, and behavioral data. The field of quantitative modeling of the auditory system spans the entirety of the auditory pathway and a wide range of auditory processes, a vast scope that cannot be covered in a single review. Therefore, in this review, we focus on quantitative models of auditory cortical processing at the level of single neurons and neural circuits that have been developed in the past decade or so. Our main objective is to highlight the insights about auditory cortical processing that have been generated by a small sample of quantitative models of different aspects of cortical activity at different levels of abstraction. In addition to informing experimental neuroscientists about contemporary computational frameworks and tools that are available to understand auditory cortical activity, our hope is to also encourage the widespread adoption of quantitative models across different levels of analysis so that these may be synthesized in the future to obtain an integrated understanding of auditory cortical function.

1.1. Purpose of quantitative models

There are three main reasons for developing quantitative models of neural processing. First, as stated above, by making the relationships between experimental observations explicit, quantitative models can provide the broader scaffolding needed to gain insight from experimental observations. Second, quantitative models can provide a simple or compact explanation for observed experimental phenomena, thereby summarizing large or complex data sets. In doing so, underlying assumptions about the system and the relative importance of various observed variables are also made explicit. Third, once a validated model that can explain existing data is developed, it can serve as a hypothesis generator for future experiments. For example, perturbations to the system can be made in-silico which can lead to novel predictions that can be verified experimentally. Most importantly, a validated model can be easily shared across the community, which can in turn enable its use in other experiments and contexts. But these models cannot be ‘one size fits all’; as we detail below, different types of models, with different motivations and levels of abstraction may be required to explain the available data or the observed phenomenon. In this review, we address a diverse set of problems that have been addressed using quantitative models, including understanding temporal-to-rate transformations in auditory cortex, the biophysical bases of cortical states, explaining the effects of behavioral state of auditory cortical activity, estimating the receptive fields of auditory cortical neurons, evaluating the impact of network activity on auditory neurons, and suggesting biologically feasible circuits to perform auditory categorization (summarized in Table 1).

Table 1:

Schema for model classification and illustrative examples of quantitative models at different scales in auditory cortex.

Model Type Scale Example purposes References
Mechanistic Single neuron E/I dynamics for ACtx ‘up’ and ‘down’ states DeWeese & Zador, 2006
E/I dynamics for temporal-to-rate code transformation in A1 Bendor, 2015
Gao and Wehr, 2015
Circuit Interval timing detection Buonomano & Merzenich, 1995
Mottanis et al., 2018
Determine E/I operational regime of ACtx circuits Kato et al., 2017
Descriptive Single neuron STRF estimation/response prediction with input transformations Gill et al., 2006
David et al., 2009
STRF estimation/response prediction with contextual modulations Ahrens et al., 2008a
Williamson et al., 2016
Network/population STRF estimation with network interactions Harper et al., 2016
Rahman et al., 2019
STRF estimation with convolutional neural networks Keshishian et al., 2020
Pennington & David, 2022
Normative Single neuron Contrast gain control by rescaling output nonlinearity Rabinowitz et al., 2011
Mesgarani et al., 2014
Categorization in noise using adaptive mechanisms Parida et al., 2022
Multi-layer Generative model for sound representation using mid-level features Mlynarski & McDermott, 2018
Categorization of complex sounds using informative and non-redundant features Liu et al., 2019

Abbreviations: A1, primary auditory cortex; ACtx, auditory cortex.

1.2. Types of quantitative models

One influential framework for understanding information processing systems was suggested by Marr (Marr, 1982), who proposed three essential ‘levels’ of inquiry. These are 1) the computational goal of the system, 2) the algorithms that may be used to achieve this goal, and 3) how these algorithms may be implemented in hardware, i.e., in the neural circuit. This overarching philosophical framework suggests several levels at which models may be developed. For example, divisive normalization by population activity to preserve dynamic range (Carandini and Heeger, 2012) is an algorithmic-level model, whereas implementing this normalization using specific connectivity patterns between excitatory and inhibitory neurons (Seybold et al., 2015) is an implementation-level model.

From a practical standpoint in terms of actual model implementation, quantitative models can be categorized into three broad classes (Wang et al., 2020). First, mechanistic models incorporate known anatomical and/or biophysical properties to build neuron models to explain experimental observations. For example, detailed auditory nerve models (Bruce et al., 2018; Zilany et al., 2014) that include cochlear nonlinearities, rate adaptation, and realistic discharge rate generators, can emulate recorded auditory nerve responses as well as explain some behavioral observations (Saremi et al., 2016). A compartmental model that includes both passive filtering in dendrites and a biologically realistic distribution of potassium channels can successfully recapitulate the sub-millisecond precision required to compute inter-aural time differences in the medial superior olive (Mathews et al., 2010). Second, descriptive models, which likely form the most prevalent model class, can be used to quantitatively characterize experimental data. For example, simple linear fitting essentially tests the model that the dependent variable varies linearly with the independent variable. Spectrotemporal receptive field (STRF) models of cortical neurons (Sharpee, 2013) attempt to explain a neuron’s responses to a given stimulus as a linear combination of the neuron’s responses to simple sound features (e.g., frequencies). Finally, normative models aim to explain why a certain computation may be functionally necessary at a particular stage of processing. For example, by starting with the principle that the auditory system needs to represent information efficiently at the input stage, and by recognizing that the system is likely biased to optimally represent some classes of natural sounds such as speech, Smith and Lewicki, 2006 were able to obtain tuning functions for theoretical input neurons that closely matched the tuning functions of recorded auditory nerve fibers. Some models of contrast gain control start with the principle that auditory neurons need to maintain their dynamic range for encoding stimuli in a variety of listening conditions (Rabinowitz et al., 2011; Willmore et al., 2014) and use dynamic adjustments of the input-output functions of model neurons to arrive at experimentally testable predictions of how firing rates of neurons would be affected in different conditions if this principle was indeed in operation. While some parallels between this practical model classification scheme and the more philosophical framework of Marr are evident (e.g., mechanistic model – implementation level, normative model – algorithmic level), for clarity, in this review we will classify models from the practical standpoint (i.e., as mechanistic, descriptive, or normative models; see Table 1 and Figure 1).

Figure 1: Schematic of model types and scales.

Figure 1:

(A) Mechanistic models use biophysical details to explain experimentally observed phenomena. Examples include single-neuron models to explain temporal-to-rate firing rate transformations (left; based on Bendor, 2015), or circuit models to explain interval selectivity of cortical responses (right; adapted from Motanis et al., 2018). (B) Descriptive models are used to quantitatively characterize experimental data. These include single-neuron encoding models such as LN models (left) as well as multi-layer encoding models (right; adapted from Keshishian et al., 2020). (C) Normative models aim to explain why a computation may be necessary at a give stage of processing. For example, one proposed reason for why selectivity for some complex features is observed in auditory cortex is that detecting a set of non-redundant features is optimal for categorizing sounds (based on Liu et al., 2019). Details of all models are discussed in the main text.

Models at each of these levels can also explain data at different scales with corresponding degrees of abstraction. For example, a mechanistic model may focus on single neuron responses, including details such as the distribution of channels along the dendrite (Migliore and Shepherd, 2002) or the time course of synaptic depression (e.g., Abbott et al., 1997), or focus on the dynamics of neural circuits with some simplifying assumptions about biophysical details (e.g., Kuchibhotla et al., 2017). Alternatively, a descriptive model may focus on stimulus encoding by single neurons, or model multiple auditory processing stages consisting of thousands of neurons (for example, modern deep neural network [DNN] models) built using simple (more abstract) integrate-and-fire units. In some cases, such as the first mechanistic example, the motivation for developing the model may be to test the contributions of fine biophysical details to neuronal function. But for the large-scale descriptive example, the details of channel distribution may be irrelevant to the question at hand, and/or modeling that level of detail may be computationally intractable. Thus, choosing the level of abstraction that is appropriate for the scale of the model is a key consideration for gaining insight into the question at hand using a quantitative model. In this review, we will discuss quantitative models at different scales, from single neurons to multi-layered neural networks (see Table 1).

1.3. Quantitative models in the ascending auditory system

Like other sensory systems, the auditory system shows increasing processing complexity along the hierarchy, i.e., neurons in earlier stages respond to simple sensory features and neurons in later stages respond to more complex features and experience greater modulations by internal state and behavioral context (Fritz et al., 2003; Montes-Lourido et al., 2021; Saderi et al., 2021; Sharpee et al., 2011). Much more anatomical and electrophysiological data is available at the earlier stages of the auditory pathway. The combination of relatively simple neural tuning and abundance of experimental data has led to the development of a wide range of models of early auditory processing across levels of analysis. An exhaustive review of these studies is beyond the scope of this review focusing on models of cortical processing. However, because these models may be used as biologically realistic ‘front-ends’ to models of higher auditory processing stages, in this section, we will briefly highlight a few of these models.

1.3.1. Auditory nerve models

At a mechanistic level, detailed models of auditory nerve tuning matched to experimental observations in humans and animals have been developed. These models include some cochlear nonlinearities, properties of the ribbon synapses, and capture the heterogeneity of nerve fiber types (Bruce et al., 2018; Zilany et al., 2014). These mechanistic details are necessary to capture some aspects of auditory nerve responses that cannot be explained by simpler models. For example, the addition of synaptic depression (using power-law dynamics) at the ribbon synapse was necessary to capture the temporal dynamics of auditory nerve fiber responses (Zilany et al., 2009). These detailed auditory nerve models also attempt to recreate neural coding aspects such as synchrony capture, two-tone suppression, and level-dependent frequency tuning. With appropriate readout strategies, these responses have been used to predict human performances, such as frequency, level, and amplitude modulation discrimination (Colburn et al., 2003; Dau et al., 1998a, 1998b; Heinz et al., 2001). The output of these mechanistic models can be used as biophysically realistic front-ends to other models as well (Carney et al., 2015; Liu et al., 2019; Saddler et al., 2021).

At a descriptive level, auditory nerve fibers show spectrally well-defined excitatory tuning with limited suppressive sidebands (due to mechanical properties of the cochlea) and simple low-pass modulation tuning (Joris and Yin, 1998; Liberman, 1978). These tuning features are well-captured by models using systems identification approaches to estimate receptive fields (Boer and Jongh, 1998; Carney and Yin, 1988; Eggermont et al., 1983). In these models, auditory nerve fibers are conceptualized as filters with specific impulse responses that confer the fiber with its frequency tuning and temporal response properties. As a further abstraction, the shapes of these auditory nerve filters can be captured as a sinusoidal carrier wave modulated by a gamma function envelope, termed a ‘gammatone’ filter (Patterson et al., 1987). In an engineering sense, auditory nerve responses can thus be approximated as a gammatone filter bank, which is a common front-end used in other auditory processing models (for example, Rahman et al., 2020). Thus, in addition to explaining some auditory nerve response properties, these descriptive quantitative models of the auditory periphery have proved invaluable in the development of other auditory processing models.

At a normative level, some elegant studies have asked why auditory nerve fibers show ‘gammatone’ tuning in the first place. Starting with the assumption that the goal of peripheral auditory encoding was the efficient representation of sounds, Smith and Lewicki, 2006 used a set of wavelet-like filters and a matching-pursuit algorithm to reconstruct incoming acoustic inputs with the least residual errors. Critically, the shapes of the filters and the times of the spikes were derived by training this model using natural stimuli such as speech sounds. Surprisingly, the shapes of the filters derived using this purely theoretical approach were starkly similar to the filters observed in electrophysiological experiments. This model thus provided critical insight into the underlying encoding strategies used by the auditory system underlying the aforementioned experimental observations.

1.3.2. Brainstem models

In contrast to the auditory nerve, neurons in the brainstem (including the inferior colliculus) show heterogenous spectral (narrow and broad) and modulation tuning (e.g., multiple peaks or band-reject) and more complex interactions between inhibitory and excitatory inputs (Krishna and Semple, 2000; Langner and Schreiner, 1988). Specialized circuits in the medial and lateral superior olive show specific tuning for spatial location (Caird and Klinke, 1983; Grothe, 2003; Kandler and Gillespie, 2005). At a mechanistic level, recent models have demonstrated how the inclusion of anatomical details such as the distribution of potassium channels along dendrites of medial superior olive neurons is necessary for explaining coincidence detection with sub-millisecond precision (Mathews et al., 2010). Models of the cochlear nucleus and inferior colliculus have been used to explain neural selectivity for second-order stimulus features such as modulation rate (Dicke et al., 2007; Nelson and Carney, 2004). At a descriptive level, tuning for modulation rates has been implemented as modulation filtering – for example, as the additive response of a narrowly tuned excitatory subunit and widely tuned inhibitory subunit tuned to the same carrier frequency (Nelson and Carney, 2004). With various readout strategies, these modulation filter bank-based models have been used to predict speech intelligibility in degraded listening conditions (Jørgensen and Dau, 2011; Scheidiger et al., 2018). Systems identification approaches have also been widely used to characterize neurons in the inferior colliculus and thalamus, with these models revealing the tuning of these neurons to more complex stimulus features (Andoni et al., 2007; Escabí and Schreiner, 2002; Miller et al., 2002). The auditory brainstem is the locus of one of the best known normative models – the Jeffress model for the computation of azimuthal location from binaural inputs using delay lines and coincidence detectors (Jeffress, 1948).

In summary, the auditory hierarchy (and that of other sensory systems) can be conceptualized as being composed of early encoding and later decoding stages. Pre-thalamic models may largely be characterized as encoding models, with their explicit aim being to explain empirically observed neural response properties or behavioral results that are closely tied to low-level stimulus representations. While detailed mechanistic models may be necessary to explain neural response properties at this level, abstractions or engineering constructs (such as filter banks) are also available that can be used as front-ends to post-thalamic decoding models. We acknowledge that this brief discussion has barely scratched the surface of the body of research on encoding models of the ascending auditory pathway. Our objective in this review is to highlight recent quantitative models of the auditory cortex, which use the features extracted from these pre-thalamic representations to accomplish specific behavioral goals.

2. Mechanistic models in the auditory cortex

2.1. Single-neuron models

Few quantitative models of the auditory cortex operate at a mechanistic level or incorporate biophysical details. These models are generally focused on explaining specific features of single auditory cortical neuron responses from in-vivo recordings. For example, one observation, particularly in anesthetized animals, is that the membrane potential of neurons is typically far from the spiking threshold (‘down’ state), with infrequent threshold crossings occurring due to large depolarizations (‘up’ states) (DeWeese and Zador, 2006; Harris et al., 2011; Hromádka et al., 2013). Conceptually, this distribution of membrane potentials is not consistent with many uncorrelated inputs; rather, it suggests that during ‘up’ states, the neuron is bombarded by a volley of spikes from a highly correlated presynaptic population. Comparing three variants of a simple model could establish the basic conditions that are necessary for such large depolarizations (DeWeese and Zador, 2006). In these variants, the membrane potential was modeled as ‘random walk’ processes with 1) purely excitatory inputs with a homogenous response rate following Poisson statistics, 2) excitatory and inhibitory inputs with homogenous firing rates, or 3) excitatory and inhibitory inputs with input firing rates varying at different time scales. This model showed that the observed data could only be matched by the third model, and only when the time scale of firing rate variations was rapid (consistent with highly correlated firing of the presynaptic population). This model illustrates how a quantitative model can both provide rigor to a conceptual explanation, but also reveal essential underlying assumptions (in this case, about the necessity of both excitation and inhibition to recapitulate the experimental observation).

Another phenomenon that has been explored using mechanistic models is the transformation of neural representations of amplitude-modulated stimuli from a temporal- to a rate-based representation by a population of cortical neurons. Two broad response types have been reported in recordings from the primary auditory cortex of unanesthetized animals (Bendor and Wang, 2007; Gao and Wehr, 2015; Lemus et al., 2009; Liang et al., 2002; Lu et al., 2001; Lu and Wang, 2004). Similar to sub-cortical responses, the spiking responses of one cortical response type (the ‘synchronized’ population) faithfully follow slow fluctuations of stimuli, but do not respond when the stimulus fluctuates too rapidly. The second cortical response type (‘non-synchronized’ population) does not respond to slow fluctuations but increases its firing rate in a manner that is proportional to the rate of stimulus fluctuations for rapid fluctuations. Passive filtering by membrane properties alone cannot account for this transformation, as the membrane potential responses of cortical neurons (already filtered by membrane properties) can synchronize to high stimulus repetition rates (Gao and Wehr, 2015). Rather, a model that included push-pull amplification of membrane potential responses by inhibition that was slightly delayed with respect to excitation was critical in producing both the observed strength of synchronization for slow stimulus rates (Bendor, 2015; Fig. 1A, left), as well as in producing high asynchronous firing rates for fast stimulus rates (Gao and Wehr, 2015). A further extension to the Bendor, 2015 model demonstrated that incorporating short-term synaptic depression was necessary to explain additional features of cortical neuron responses (Lee et al., 2020). While the specific neuron types that provide the inhibition required by these models are unidentified, and the role of other intrinsic properties such as bursting (Liu and Wang, 2022) remain unexplored, these studies show how quantitative models at this level can help identify putative mechanisms and precise parameter ranges (for example, E-I delay) that are required to explain observed neural responses.

2.2. Circuit models

Models at the mechanistic level can also be used to gain insight into how some critical computations can be implemented by simple neural circuits. For example, a key question in auditory processing is that of timing. Detecting precise temporal intervals in the range of tens to hundreds of milliseconds is important for the accurate recognition of complex sounds that contain spectrotemporal information over such wide timescales, such as speech (Aasland and Baum, 2003; Tallal, 1994), vocalizations (Montes-Lourido et al., 2021; Sadagopan and Wang, 2009), and music (Janata and Grafton, 2003). But how neural circuits can achieve selectivity for precise temporal intervals is yet to be understood. To address this question, Buonomano and Merzenich, 1995 built a simple model with integrate-and-fire units but incorporating known synaptic properties of cortical neurons such as paired-pulse facilitation (a form of short-term plasticity). The model consisted of two layers of excitatory and inhibitory neurons, approximating the microstructure of a cortical column (layers 4 and 3). The task was to discriminate the timing between two pulses delivered at the inputs. Delayed inhibition with respect to excitation, slow kinetics of the inhibitory post-synaptic potential, and the kinetics of paired-pulse facilitation altered the state of the circuit after the first pulse and caused neurons in the second layer (layer 3) to respond differentially to the second pulse. The pattern of activity in second layer neurons could be used to decode the time interval between the two pulses with high accuracy. Critically, although the model was trained to discriminate the timing between only two pulses using a handful of inter-pulse intervals, the model generalized to more complex trains of pulses without further adjustments. In a follow-up study, Buonomano, 2000 demonstrated that a simple disynaptic circuit with the above properties was sufficient to achieve order- and interval-selectivity but required precise parameter tuning. However, a large circuit with greater heterogeneity could achieve generalization without the need for such tuning. Thus, these simple circuit models demonstrated how cortical neurons can achieve selectivity for order and interval in an ethologically relevant timescale (tens to hundreds of milliseconds) using experimentally observed mechanisms such as short-term plasticity (Motanis et al., 2018; Fig 1A, right) and without the need for using components that may not be biologically realistic, such as temporal delay lines.

In a similar vein, simple circuit models incorporating the unique dynamics of short-term plasticity in different cortical neuron populations, such as parvalbumin (PV) and somatostatin (SOM) expressing inhibitory neurons, have been shown to recapitulate the varied temporal profiles of excitatory cortical neuron responses to pulse trains (Phillips et al., 2017a, 2017b). Using a spiking model (rather than a firing rate-based model) that captured these differential synaptic dynamics, excitation-inhibition delays, and the relative balance of inhibition from these two inhibitory neuron types was a critical factor in the success of this approach in reconstructing cortical response types (Seay et al., 2020). But importantly, the model could also generate experimentally testable predictions regarding the spike latencies of excitatory neurons and how inactivating specific subpopulations of inhibitory neurons would alter excitatory neuron responses. During sequential tone presentation, the model predicted that inactivation of PV interneurons (but not SOM interneurons) during the first tone would decrease the firing rate of excitatory neurons to the second tone. This prediction was verified by performing optogenetic inactivation of the PV population during in-vivo recordings, further validating the model. This example illustrates how quantitative models can help discern between multiple mechanisms that might underlie an observed phenomenon.

Other circuit models have been used to explain or predict the effects of the targeted activation or inactivation of cortical cell types on the activity of excitatory neurons. For example, using optogenetic manipulations targeted to PV interneurons, Aizenberg et al., 2015 demonstrated that photoactivation of PV neurons led to increased acuity of frequency discrimination, whereas photo-suppression led to impaired performance and led to degraded specificity of learning. To understand these experimental observations, the authors built a mutually coupled excitatory–inhibitory firing rate model. The model also included inhibitory-excitatory synapse non-linearities, particularly PV-generated synaptic depression. With only these simple considerations, the model could replicate experimental observations, generating insight into possible circuit wiring motifs that could explain these experimental observations. Seybold et al., 2015 also used a simple circuit model to explain the diverse effects of experimentally manipulating inhibition. To capture the effects of optogenetically manipulating PV and SOM neurons on excitatory neurons, they modeled a simple excitatory-inhibitory circuit regulating the firing thresholds and the excitatory and inhibitory conductances. The central insight the model provided was that changes to the firing threshold of neurons, varying the strength of suppression onto it and the interaction of these changes with the network could “mask” divisive scaling as subtractive and vice-versa, providing a cautionary tale for experimentalists. Similarly, Phillips and Hasenstaub, 2016 showed using a simple circuit model how relatively small changes to parameters such as spontaneous rate and the strength of light manipulation could result in vastly different outcomes when modeling the effect of photoactivation or inactivation. These examples illustrate how in addition to helping gain a deeper understanding of existing data, quantitative models can also guide experimental design.

Larger-scale network models incorporating more biological details have also provided insight into how activity in the cortex is maintained in a stable state and how external as well as internal factors can alter cortical state. For example, Kuchibhotla et al., 2017 used a network model consisting of four major cortical cell types – excitatory, PV, SOM, and vasointestinal peptide (VIP) expressing interneurons – to demonstrate that neuromodulation of all three major types of inhibitory neurons played a role in active behavior, and that driving either only inhibition or disinhibition could not reproduce the effects of active engagement seen on the auditory cortical neurons. Larger scale recurrent network models and state space models have been used to gain a deeper understanding of neural dynamics observed using large-scale electrophysiological or optical neural recordings. One of the most powerful models of recurrent network dynamics to emerge in recent years across sensory domains is the inhibition-stabilized network (ISN). In a network where excitatory recurrent connections are strong enough to result in instability (runaway excitation) in the absence of any inhibition (as is the case in the cortex), increasing external excitatory input to inhibitory neurons can paradoxically lead to decreased activity in inhibitory neurons (Latham and Nirenberg, 2004; Tsodyks et al., 1997). Experimental support that the cortex operates in such an inhibition-stabilized regime was provided in the visual cortex by Ozeki et al., 2009, who used this framework to explain the phenomenon of surround suppression in primary visual cortex neurons. Using in-vivo intracellular recordings, Ozeki et al., 2009 demonstrated a transient increase in inhibitory conductance followed by a sustained decrease when inhibitory receptive field regions were excited, as predicted by an ISN model. In fact, only an ISN architecture could explain these experimental results. In the auditory cortex, Kato et al., 2017 used the analogous phenomenon of lateral inhibition and in-vivo intracellular recordings to show that the auditory cortex may also operate in an inhibition-stabilized regime. More recently, Aponte et al., 2021 used an auditory cortex model with ISN architecture to explain direction selectivity of A1 neurons to frequency-modulated sweeps. Operation of the model in the inhibition-stabilized regime was crucial in ensuring that the cortical circuitry amplified thalamic inputs in one sweep direction, whereas such amplification was prevented by leading inhibition in the other sweep direction. Bondanelli et al., 2021 used a state-space model to gain new insights into OFF responses in the auditory system. They showed that a network model including recurrent dynamics outperformed standard single-neuron models in explaining OFF responses. Not only did the model exhibit transient OFF responses similar to those observed in experiments, but could also make predictions on the low-dimensional nature of such OFF responses that were verified experimentally. These models illustrate how larger-scale models can be used in conjunction with large data sets generated from electrophysiological and imaging experiments to gain a deeper understanding of the network dynamics that surround and shape single neuron responses.

3. Descriptive or encoding models

3.1. Single neuron models

While the mechanistic examples in section 2 attempt to explain the responses of cortical neurons in terms of the interaction between excitatory and inhibitory inputs and intrinsic properties, validating mechanistic models typically requires access to membrane potential recordings that are technically difficult to acquire. Extracellular recordings of spiking activity are by far the most common type of auditory cortical response data available at the single neuron level. This has spurred the development of several powerful approaches to characterize how stimuli are encoded by cortical neurons, i.e., to characterize the spectrotemporal receptive fields (STRFs) of these neurons. Early studies employed reverse correlation approaches using white noise acoustic stimuli to determine the average stimulus waveform that evoked a spike in auditory nerve fibers (Boer, 1975; de Boer, 1991, 1969). These studies laid the foundation for STRF models that characterized the input-output relationships of peripheral auditory neurons (Aertsen and Johannesma, 1980; 1981a; 1981b). STRF models have since undergone several adaptations to explain the activity of neurons across the auditory hierarchy including the cochlear nucleus (Bandyopadhyay et al., 2007; Tan et al., 2015), auditory thalamus (Miller et al., 2002), and the auditory cortex (Christianson et al., 2008; Klein et al., 2000; Kowalski et al., 1996a, 1996b; Miller et al., 2002; Schnupp et al., 2001; Sharpee, 2013). However, the ability to predict neural responses to new stimuli (prediction accuracy) of early linear STRF models was not high (Sahani and Linden, 2003), likely because auditory cortical responses show a range of strong nonlinearities. A previous review has extensively described the development of STRF models of auditory cortical responses (Sharpee, 2013). In this section of the review, we will focus on more recent descriptive models that, in addition to modeling linear receptive field components, also include some nonlinearities that are apparent in cortical data.

The spiking threshold as well as power law relationship between the membrane potential and firing rate are strong nonlinearities that shape cortical responses. Linear-nonlinear (LN) models for STRF estimation have been developed to account for these effects. Like linear models, LN models also convolve a linear filter with the stimulus spectrogram, but then impose a static non-linearity on the filter output to obtain time-varying firing rate estimates (Atencio et al., 2008). The static nonlinearity often takes the form of a rectified-linear or power law function. Low degrees of ‘match’ between the STRF and the stimulus evoke no spiking output, whereas a high degree of match would evoke correspondingly (linear) or disproportionately (power law) high outputs. A further extension to the LN model framework is the LNP (Linear-Nonlinear-Poisson) model, which uses the time-varying firing rate output of the LN model as the instantaneous firing rate of a non-homogenous Poisson process to generate spike outputs (Moskovitz et al., 2018; Paninski et al., 2007; Schwartz et al., 2006; Williamson et al., 2015). In addition to achieving higher prediction accuracy compared to purely linear models, the LN model is powerful in that for the same inputs and linear STRFs, minor adjustments to the slope or exponent of the static nonlinearity can result in sizeable changes to the output firing rate. This property has proved critical for modeling contrast gain control. For example, several studies have shown that varying the nonlinearity in a contrast-dependent manner can better capture the responses of cortical neurons in noisy listening conditions (Cooke et al., 2018; Rabinowitz et al., 2012, 2011; Willmore et al., 2014). Athough LN models exhibit better predictive power than purely linear models, the performance of LN models in predicting responses to natural stimuli is still modest and fails to capture some important aspects of auditory coding (David et al., 2007; Machens et al., 2004).

Some of these issues may arise because of sub-optimal inputs to the STRF models. Typically, STRF models have used the stimulus spectrogram as a front-end to the model, but as discussed earlier, given the availability of detailed models of the auditory periphery, a more biophysically realistic front-end could be applied. Moreover, because the auditory cortex is located several synapses away from the auditory periphery, it is reasonable to expect that cortical inputs have been substantially transformed by the subcortical hierarchy. A recent study showed, however, that relatively simple front-ends to STRFs could perform as well as biophysically realistic front-ends (Rahman et al., 2020), suggesting alternate causes for the low predictive power of STRFs. For example, since it is known that inferior colliculus (IC) neurons also experience contrast gain control and adapt their firing rates to stimulus statistics (Dean et al., 2008), this IC adaptation could be used to transform the inputs to cortical linear filters. Such an ‘IC adaptation’ model indeed resulted in an improvement in prediction accuracy (Willmore et al., 2016). Other studies have applied various transformations to the input to the linear filter to take into account biological observations (Gill et al., 2006). For example, the relationship between sound intensity and neural response may be nonmonotonic (Brugge and Merzenich, 1973; Pfingst and O’Connor, 1981; Polley et al., 2007, 2004; Sadagopan and Wang, 2008). Modeling level-dependence as an input nonlinearity improved the performance of STRF models (Ahrens et al., 2008b). Similarly, transforming the inputs to represent stimulus surprise, i.e., whether a stimulus component was louder or quieter than expected given stimulus history and context, improved prediction accuracy in the avian forebrain (Gill et al., 2008). Other input transformations include accounting for the strongly depressing nature of the synaptic inputs to the cortex (Wehr and Zador, 2005). Including strongly depressing synapses at the inputs improved prediction accuracy by differentially affecting the STRF for stimulus classes with different temporal dynamics (David et al., 2009). Incorporating synaptic depression with multiple timescales similarly improved prediction accuracy (David and Shamma, 2013). More broadly, applying short-term plasticity dynamics to all or a subset of cortical inputs in this framework, resulted in varying levels of benefit (Espejo et al., 2019). Recent results have shown that models incorporating short-term plasticity mechanisms at the inputs are complementary to the output nonlinearity-based gain control mechanisms described above (Pennington and David, 2020). More generally, the local context in which a sound occurs (of which the above mechanisms are a subset) can powerfully modulate the activity of cortical neurons (e.g., (Sadagopan and Wang, 2010, 2009)). Modeling the local context as a gain field applied to the inputs improved prediction accuracy (Ahrens et al., 2008a; Williamson et al., 2016). But even with the addition of these nonlinearities, STRF predictions often systematically deviate from observed firing rates, likely because the fundamental linear model does not consider nonlinear spectrotemporal interactions that are known to shape cortical responses (Machens et al., 2004; Sadagopan and Wang, 2009). Furthermore, the underlying assumption of linearity requires stimuli to exhibit Gaussian statistics along relevant dimensions. But because natural stimuli are not Gaussian and consist of strongly correlated time-frequency components, STRFs estimated using reverse correlation from these stimuli are often strongly dependent on stimulus characteristics. This results in poor cross-predictivity, i.e., STRFs derived using one stimulus set are quite poor at predicting responses to another stimulus set (David et al., 2009; Laudanski et al., 2012; Theunissen et al., 2000; Woolley et al., 2005).

To account for non-Gaussian stimuli and some potential nonlinear interactions, alternative methods of STRF estimation have been developed. Sharpee et al., 2004 took an approach based on estimating the most important stimulus dimensions for generating neuronal responses in an information-theoretic sense (also see Sharpee, 2013). While the first most informative dimension typically corresponded to the spike-triggered average, i.e., the STRF estimated using a linear estimator, the second most informative dimension captured unique response characteristics such as non-monotonicity that were not captured by the first component (Atencio et al., 2008). Thus, the two receptive field components were synergistic, and provided higher prediction accuracy compared to only using the first component, i.e., the LN-like component. Extending this approach (but in the visual cortex), Rowekamp and Sharpee, 2011 determined maximally informative subspaces that incorporated several receptive field components that synergistically provided high levels of information about the stimulus. This has also been observed in higher auditory cortical regions of the avian forebrain, where individual neurons respond to multiple distinct features of bird song and are their receptive fields are composed of numerous components (Kozlov and Gentner, 2016). The observation of several synergistic receptive field components suggests that perhaps the inputs to cortical neurons may themselves be modeled as individual LN units. In this vein, Harper et al., 2016 developed a network receptive field (NRF) model in which inputs to the cortical neuron to be modeled were themselves modeled as 20 LN units. This model outperformed traditional LN models but showed longer latencies and integration times than seen in auditory cortical neurons. Extending the NRF model further, Rahman et al., 2019, built in exponentially decaying memory for the input LN units of the model and modeled each unit’s firing using a dynamic equation. Capturing the dynamic integration of recent stimulus history by multiple input units in this manner indeed improved prediction accuracy further.

A second approach to ameliorate the effect of stimulus correlations in estimating receptive fields uses the generalized linear model (GLM) framework instead of reverse correlation to estimate linear STRF and nonlinearity parameters of LN neurons. Calabrese et al. (2011) demonstrated that the GLM framework, with appropriate sparseness constraints, could generate STRFs with high prediction accuracy across stimulus classes. In their model, each neuron’s output spike train was generated from a Poisson process, whose instantaneous rate was determined by the convolution of a linear filter (STRF) with the stimulus spectrogram and an additional post-spike filter capturing the neuron’s recent spike history (Fig. 1B, left). Paralleling the results of Smith and Lewicki (2006) in the auditory nerve, the use of a sparseness constraint, i.e., assuming that the neuron was selective only to a small number of stimulus features, was critical in achieving high prediction accuracies. More generally, Stevenson et al., 2012 proposed a GLM encoding model which modeled a neuron’s firing rate in terms of 1) external variables or tuning properties of the neuron alone, 2) the activity of the other simultaneously recorded neurons, or 3) a combination of neuron tuning and the activity of other neurons. In several brain areas including A1, incorporating even a small number (10 – 30) of simultaneously observed neurons into the encoding model improved prediction accuracy considerably compared to purely tuning models. In addition to incorporating the impact of the activity of other neurons, the GLM framework also lends itself easily to incorporating other non stimulusrelated factors affecting neural responses, such as task variables, arousal, and reward (Pachitariu et al., 2015; Runyan et al., 2017; Saderi et al., 2021). As access to larger data sets become more feasible and as model fitting techniques improve, it is likely that even more complex model architectures will provide increasingly better quantitative descriptions of single neuron spiking-data (Pennington and David, 2022), essentially placing single-neuron spiking activity in the context of the spiking activity of a large circuit. The limited biological interpretability of these models, however, will present new challenges in the quest to gain insight into auditory cortical processing.

3.2. Multi-layer models

In section 3.1, we briefly mentioned models where the inputs to neurons were themselves modeled as LN units (e.g., Harper et al., 2016; Kozlov and Gentner, 2016; Rahman et al., 2019). A generalization to this approach is to model the inputs to cortical neurons as the output of a multi-layered network, approximating the biological inputs to cortical neurons that are filtered by many subcortical processing stages. Here, individual units in the input layers are often abstracted as LN models, where the linear component is an STRF that is convolved with the stimulus, and the nonlinear component is a static output nonlinearity. The large parameter space and computational demands of such models often preclude the addition of any additional complexity or biophysical realism. As a recent example, Pennington and David, 2022 explored the use of several convolutional neural network (CNN) architectures that differed in terms of the number of convolutional layers and the number of dense or fully connected layers to fit the experimentally observed spiking activity of single neurons as well as neural populations. They demonstrated that for the same number of free parameters per neuron, CNN models handily outperformed traditional LN models in explaining cortical neuron responses. In addition, the models showed a high degree of generalizability in that the model, when trained on one population of neurons, performed well in fitting the activity of another population of neurons. Although a high degree of variance of cortical neuron firing rates can be explained by these models, the trade-off is human interpretability. Whereas the interpretation of an LN model is straightforward (convolution with linear receptive field followed by thresholding), the patterns of activity in the CNN that lead to the final fit are complex and difficult to visualize. Although there is no reason to expect that cortical computations need to be human interpretable (as discussed later in this review), this is one drawback of using such multilayer models to gain intuition about underlying neural computations from experimental observations. To address interpretability, Keshishian et al., 2020 used a standard linear STRF or convolutional neural networks (CNN) to fit cortical responses to speech recorded from human auditory cortex (Fig. 1B, right). Responses to speech stimuli were taken to be the envelope of the high-gamma band of the electrocorticogram, recorded from human Heschl’s gyrus and superior temporal gyrus. As with the Pennington and David, 2022 study, this study also demonstrated that CNNs outperformed linear models in terms of prediction accuracy. But to better understand the nonlinear mapping between the stimulus and the response, the authors developed a technique, termed the dynamic STRF or DSTRF, which determines the linear function at each time point of the stimulus that is mathematically equivalent to the nonlinear transformation performed by the CNN. In this framework, neurons could be classified into those showing gain changes to the STRF, those showing delay changes to the STRF, or those showing STRF shape changes over time. The conceptualization of the CNN mapping as a dynamically varying STRF, and the description of responses in these categories helped improve the interpretability of these multi-layer models.

4. Normative models of auditory cortical computational

4.1. Single neuron models

At the single-neuron level, experimental observations have demonstrated that cortical neurons tend to maintain their dynamic activity range in a wide variety of listening environments (Rabinowitz et al., 2011; for a detailed review, see Willmore et al., 2014). Based on ideas developed in the visual cortex (Carandini and Heeger, 2012), it has been suggested that the reason for maintaining this dynamic activity range is to preserve neural sensitivity to acoustic features across listening environments (invariance). Contrast-dependent adjustments to the nonlinear stage of LN models, termed contrast gain control, has been proposed as a possible method of implementing contast invariance (Rabinowitz et al., 2011; Willmore et al., 2014). High-contrast sounds (e.g., speech in quiet) show greater variations in sound pressure levels (SPL) than do low-contrast sounds (e.g., speech in noise), consequently providing a narrower range of inputs to neurons in low-contrast situations. To preserve the sensitivity of neurons to this reduced range of inputs, one possible solution could be to adjust the slope of the nonlinearity in LN models of neurons in a contrast-dependent manner to maintain a consistent output dynamic range. In addition to rescaling of the output nonlinearity, it is also possible that other input transformations such as synaptic depression could contribute to the contrast invariance of auditory cortex responses (Mesgarani et al., 2014). While both inhibition and contrast-dependent fluctuations in membrane potentials have been proposed as mechanistic implementations of contrast gain control in the visual cortex, the precise mechanisms underlying contrast gain control in the auditory cortex have not yet been identified. For example, a recent study found that contrast gain control could occur independently of inhibitory neuron activity in the auditory cortex (Cooke et al., 2020). Our laboratory has recently proposed another normative model for why such rescaling mechanisms are important in auditory cortex from the perspective of accomplishing auditory categorization (Parida et al., 2022; see section 6 below).

4.2. Network models

Normative multi-layer models have been proposed to develop intuition about why complex receptive fields are observed in higher auditory cortical areas, and what computations such receptive fields might subserve. For example, Młynarski and McDermott, 2018 developed a two-layer generative hierarchical model, where the first layer consisted of spectrotemporal kernels (similar to STRFs) that were trained to sparsely represent natural sounds. Units in the second layer learned time-dependent combinations of first-layer spectrotemporal kernels, also under a sparseness constraint. After training with natural stimuli such as speech and environmental sounds, model units showed several similarities with experimental data. For example, first- and second-layer spectrotemporal kernels showed differences in their modulation transfer functions that were consistent with those measured from the thalamus and cortex (Miller et al., 2002). The feature selectivity of first- and second-layer units were also consistent with previous studies (Atencio et al., 2009). This increasing feature selectivity is reminiscent of that observed across auditory cortical laminae in other species (Montes-Lourido et al., 2021; Sadagopan and Wang, 2009). The fact that model units resembled experimental data thus suggested that the theoretical principles from which the model was constructed - learning dynamic second-order kernels from natural sounds – may also be the computation being performed by cortical neurons. Critically, the model also generated predictions regarding the similarity of second-layer receptive fields (due to the pooling of first-layer neurons) and opponent tuning of second-layer receptive fields that could be tested experimentally. In section 6, we discuss in detail work from our laboratory that proposes another normative model for the observation of complex receptive fields (Liu et al., 2019) that uses the detection of maximally informative features to accomplish auditory categorization tasks.

5. Goal-directed network models

Sensory information in the world is hierarchically organized. In audition, for example, speech is hierarchical at a linguistic level (sentences, phrases, words) as well as at an acoustic level (information content in timescales from tens of milliseconds to seconds). To process information at these varying levels, it can be argued that the brain as an information processing system might therefore adopt a hierarchical structure as well. The ultimate purpose of extracting information at these levels is to guide appropriate behavioral responses. For example, low-level representations of the words ‘Go’ and ‘Stop’ might show variations in pitch, duration, and formant peak locations, but these variations are inconsequential to perform the behaviors associated with these words. Goal-directed models emulate hierarchical architectures to transform early acoustic representations to higher-level representations that facilitate performance in tasks such as stimulus categorization. In this supervised approach, models are trained on a labeled dataset. The input layer typically is a dense representation of the stimulus or stimulus features, and the output layer typically reports the stimulus category. The number of layers between the input and output layers (network depth) varies greatly depending on the task, with more complex mappings requiring greater layer depth. For example, a recent DNN-based automatic speech recognition system that achieved near-human recognition accuracy for conversational speech used a 25-layer network to process stimulus spectrograms (Thomas et al., 2019). Neurons in these layers are modeled as abstract LN units. From an engineering perspective, many previous studies have used deep network architectures and large-scale data driven approaches to tackle issues of auditory object recognition such as phoneme categorization (Lee et al., 2009), extracting vocals from music (Simpson et al., 2015), audio segmentation (Sainath et al., 2007), audio-tagging (Xu et al., 2017), and speech separation (Hershey et al., 2016; Luo et al., 2018). However, only a few studies have used this framework to gain insight into auditory processing in the brain (Kell and McDermott, 2019; Richards et al., 2019). In this section, we will highlight a few such studies.

5.1. Insights into auditory processing from DNN-based models

Güçlü et al., 2016 developed one of the earliest DNN-based auditory models that quantified the similarity between the representation of music in a DNN and in the human brain (using fMRI data). Using representational similarity analyses, the authors found that anterior STG (Superior Temporal Gyrus) was similar to shallower layers of the DNN representing low level features, whereas the posterior STG represented higher level features of music and was more similar to deeper DNN layers. Kell et al., 2018 trained various DNN architectures to perform speech (syllable recognition) and music (genre recognition) tasks. While distinct networks separately trained on each task performed the best, the study showed that networks that shared many early layers of processing (low-level features) and segregated into two pathways at deeper layers could also perform well in these tasks. The activation patterns in such networks strongly resembled human fMRI voxel-activation patterns when subjects were performing the same tasks. These results suggest that earlier layers (in both artificial and biological networks) may efficiently represent stimulus statistics whereas deeper layers show flexible and/or parallel pathways to perform specific tasks. Deeper layers of DNNs also show higher correlations with central and frontal components of EEG activity that originate from higher regions of the auditory processing hierarchy (Huang et al., 2018). This transformation is consistent with sequential nonlinear feature extraction by the auditory pathway, a phenomenon that was also observed in a hierarchical spiking neural network model (Khatami and Escabi, 2020). Here, the authors constructed a six-layer densely connected network of integrate-and-fire units, and optimized connection weights for maximizing performance in a word-in-noise recognition task. The STRFs of model neurons in the trained network resembled those observed in-vivo, with temporal integration durations increasing with layer depth, mirroring the increasing integration windows along the auditory processing hierarchy. The model also captured other experimentally observed phenomena such as increasing sparsity along the hierarchy. The authors suggested that the model is indicative of a lossy transformation, where task-relevant cues are enhanced and noisy or irrelevant cues are discarded from successive neural representations along the hierarchy. However, one aspect of the data that this model did not capture was the increase in auditory receptive field complexity along the frequency axis at the higher levels of the auditory processing hierarchy, including selectivity for harmonic stacks (Feng and Wang, 2017; Norman-Haignere et al., 2013; Patterson et al., 2002; Penagos et al., 2004; Tang et al., 2017) and other complex features (Montes-Lourido et al., 2021; Sadagopan and Wang, 2009).

5.2. Limitations of DNN-based approaches

DNN models are powerful in that they can achieve near-human performance on complex auditory tasks, and can explain a high fraction of the variance of some experimental observations (such as BOLD activity in the auditory cortex). A range of available tools and software facilitate the ease of building and deploying these models, and these models are particularly well-suited for the analysis of large datasets. However, several questions remain regarding the extent to which DNNs can be used to gain insight into biological mechanisms. For example, state-of-the-art DNNs boast tens of layers, which is substantially greater than the number of processing stages in the brain. In comparison to biological systems, DNNs need vast amounts of labeled training data to train and model performance often generalizes poorly to other (even related) tasks unless the model is retrained. Although DNNs can approach or surpass human performance in specific visual and auditory tasks, they exhibit very different patterns of error compared to human subjects (Schrimpf et al., 2020; Weerts et al., 2022). Finally, it is not straightforward how biological/neural mechanisms that are known to improve neural representations in different brain layers can be implemented in these models. For example, DNNs use a single abstract cell type to accomplish the task at hand, whereas the diversity of cell types in both subcortical and cortical stages of the auditory hierarchy is well-established (Arlotta and Paşca, 2019; Osen, 1969; Peruzzi et al., 2000; Winer, 1984; Winer and Morest, 1983). In the visual system, CNNs are powerful because they use the same linear filter to tile visual space to model translation invariance, but it is unclear how this maps onto the auditory system where translation along the cochlea is a critical determinant of auditory object identity. Finally, inputs to DNN models treat short segments of spectrograms as images and do not capture long term temporal dependencies. While recent advances using recurrent architectures and attention units address some of these discrepancies between artificial and biological neural networks (Magnuson et al., 2020), it is likely that current DNN-based models of the auditory pathway approximate some of these complexities by using additional network layers, complicating their biological interpretation. Finally, from the perspective of Marr’s levels of understanding, the computational goal (categorization) and implementation (convolution) levels of DNNs are clear, but the algorithm used to achieve categorization cannot be easily stated. Thus the development of biologically interpretable models is an important avenue of research from a neuroscientific perspective.

6. A biologically realistic and interpretable model of auditory categorization

Over the past few years, our laboratory has developed a hierarchical model that is trained to solve specific categorization tasks but that emphasizes biological realism and interpretability (Liu et al., 2019; Montes-Lourido et al., 2021; Kar et al., 2022; Parida et al., 2022). We adopted a top-down approach inspired by Marr’s levels of understanding and first identified a computational goal or task for the model to solve. To maximize the insight that could be gained about cortical mechanisms from the model, it was critical to choose both stimuli and tasks that would be of sufficient complexity and behavioral relevance to engage cortical pathways. In vocal animals, one such task is the categorization of conspecific vocalizations. Vocal animals (in particular, mammals) such as marmosets (Agamaite et al., 2015) and guinea pigs (Berryman, 1976; Grimsley et al., 2012) use several categories of vocalizations to communicate internal states such as displeasure or fear, external events such as food availability, and social contexts such as affiliative or aggressive intent. Although they are not as complex as human speech, these vocalizations exhibit reasonable spectrotemporal complexity, containing features such as frequency-modulated sweeps, harmonic stacks, and temporally repeating segments. Additionally, within each species, vocalization types often overlap in spectral content (likely constrained by the biomechanics of that species’ vocal tract), which means that animals cannot determine category based on spectral content alone. But critically, these vocalizations are produced with high trial-to-trial and inter-subject variability. To produce appropriate behavioral responses, the auditory system must generalize over within-category variability and ascertain the category of a given vocalization, a complex task. Thus, vocalizations are complex stimuli that are behaviorally relevant, and categorizing them is a complex task, meeting the conditions necessary for robust cortical engagement. Experimentally as well, while simple stimuli such as pure tones do not drive robust activity in higher cortical areas (Bendor and Wang, 2008; Rauschecker and Tian, 2000), vocalizations are known to evoke robust responses in secondary and higher cortical areas (Rauschecker et al., 1995; Tian et al., 2001) as well as in prefrontal regions (Cohen et al., 2007; Gifford et al., 2005; Romanski and Goldman-Rakic, 2002). Specialized regions that preferentially process vocalizations may exist in higher auditory cortical regions of humans (Belin, 2006; Bodin et al., 2021), macaques (Perrodin et al., 2011; Petkov et al., 2008), and marmosets (Sadagopan et al., 2015). Thus, it is reasonable to posit that vocalization categorization is one computational goal of auditory cortical processing.

Having identified a computational goal, we next asked how this goal could best be achieved, i.e., we constructed a normative model to determine the algorithms the auditory system might use to solve vocalization categorization. In this endeavor, we looked at parallels in the visual system. Like vocalizations, faces are complex objects, are behaviorally relevant, and exhibit tremendous within-category variability. Cortical regions at the highest stages of the visual processing hierarchy seem to preferentially respond to faces over other objects (Freiwald and Tsao, 2010; Tsao et al., 2003; Tsao and Livingstone, 2008). Thus, despite the different sensory modalities, and while recognizing that faces and vocalizations are not comparable in stimulus space (face information is available instantaneously in two spatial dimensions whereas vocalization information is only available in one dimension over time), the apparent similarities in how faces and vocalizations are represented in the visual and auditory hierarchies led us to examine algorithms for face categorization. Of these, a categorization algorithm based on the detection of maximally informative features of intermediate complexity (Ullman et al., 2016, 2002) stood out as a particularly suitable candidate for adaptation to vocalization categorization.

In this approach, the objective was to identify a set of intermediate-sized features that (as a set) best distinguishes between the category of interest (a particular vocalization type), and all other categories. For our model (Liu et al., 2019), the first step was to use a suitable front end, such as a biophysically realistic auditory nerve model (Zilany et al., 2014), to generate biologically realistic input representations (cochleagrams) of a training set of within-category and outside-category stimuli. Next, a large number of random candidate features were generated from the within-category stimuli. Each feature corresponded to the spatiotemporal activity of a random subset of auditory nerve fibers within a random time window (a rectangular region of the cochleagram). These features may be conceptualized as the linear receptive field of a neuron that would be best driven by that feature. Then, we used a template-matching algorithm (normalized cross-correlation) to determine how well each candidate feature “matched” within-category and outside-category vocalizations in the training stimulus set. We convolved each feature with the stimulus (restricted to the bandwidth of the feature), resulting in a one-dimensional vector of normalized cross-correlation values, which may be conceptualized as the membrane potential fluctuations of the feature-sensitive neuron. The distributions of the maximum normalized cross-correlation values for the within-category and outside-category stimuli determine the usefulness of the feature in categorization (Fig. 1C). For example, a good feature would result in high maximum cross-correlation values for within-category stimuli and low cross-correlation values for outside-category stimuli, with little overlap between the two distributions. For each feature, we determined an optimal threshold for feature detection at which within- and outside-category distributions were best separated, with the feature deemed present in the stimulus if the cross-correlation value exceeded this threshold. At this optimal threshold, we also determined the merit or usefulness of the feature for categorization using mutual information, and a weight defined as the feature’s log-likelihood ratio. The threshold may be conceptualized as the spiking threshold of a feature-selective neuron – when the membrane potential (cross-correlation) exceeds this threshold, the feature-selective neuron spikes, signifying the presence of its preferred feature. Thus, in this framework, each feature-selective neuron is an LN unit with a linear receptive field (feature) and a stationary nonlinearity (optimal threshold).

The critical computation in the model was how a final set of features was chosen from this initial randomly generated set. In the initial set, because of the random nature of candidate feature generation, many features were not informative, and the best features could be redundant (i.e., encoding the same stimulus features). Thus, as with the face categorization studies (Ullman et al., 2002), we used greedy search optimization to select, from this initial random set, a most informative feature (MIF) set that maximized categorization performance while minimizing self-redundancy. Briefly, we chose the best performing feature as the first member of the MIF set. The second member of the MIF set was the feature that, together with the first member, added the most information for categorization. We sequentially chose members that added the most information for categorization in a pairwise fashion until no more information could be added or a stopping criterion was reached (e.g., added information < 0.001 bits; see Liu et al., 2019 for details). At the end of this procedure, we were left with MIF sets consisting of a handful (10 – 20) of features for categorizing each vocalization type. For any given stimulus, we convolved all MIFs with the stimulus and determined which MIFs were detected, i.e., which of the maximum normalized cross-correlation values crossed the MIFs’ respective thresholds. The total evidence for the presence of a stimulus category was then taken to be the sum of the weights (log-likelihoods) of the detected features (Fig. 1C). We demonstrated that using this set of MIFs alone was sufficient for the model to generalize over the training vocalization set, as well as to a new testing set of within- and outside-category vocalizations. The model achieved high categorization performance (>95% accuracy) on vocalization categorization tasks across several species (marmosets, guinea pigs, and macaques). Additionally, using the same algorithmic strategy, we achieved high performance on other auditory tasks (caller recognition for marmoset twitter calls). Interestingly, similar to face categorization, features of intermediate size (in terms of duration and bandwidth) and complexity were most informative for vocalization categorization. When features were too large (encompassing the entire vocalization), generalization to other within-category exemplars was impaired; when features were too small (only integrating over a few fibers for a few milliseconds), their frequency of occurrence was similar in within- and outside-category vocalizations.

Thus, in our model, the algorithmic solution for the categorization goal was to detect a set of most-informative features that were best-performing as a set and that were least self-similar. Our model can be conceptualized as a three-layer network – the MIF units form a sparse feature-detector layer that integrates inputs from a dense spectrotemporal (cochleagram) layer (Figure 1A). The MIFs are biologically interpretable because they were derived to perform a specific and well-defined task – the set is chosen to maximize performance and minimize redundancy. Individual MIFs are LN units that spatiotemporally integrate over a small number of input units. A final output layer linearly weights detected features from the feature-detector layer and reports category. Thus, the model architecture is also interpretable and biologically realistic. MIFs can be learned from tens to hundreds of examples. However, the task specificity of the MIFs – a distinct set of MIFs need to be learned to categorize each vocalization type or caller identity – also makes our algorithm quite inefficient. The conceptualization that each task requires a distinct set of features is supported by some experiments. For example, face-selective (Freiwald and Tsao, 2010; Tsao et al., 2003) and vocalization-selective (Perrodin et al., 2011; Grimsley et al., 2012) neurons exist in higher visual and auditory cortex, respectively. But the true value of the model lies in its ability to generate biologically grounded hypotheses and experimentally testable predictions, as detailed below.

First, the model predicted an increase in feature-selectivity between two layers of processing in the auditory hierarchy, i.e., between a layer that densely represents all features of the stimulus (Fig. 2A, bottom) and a layer that sparsely represents specific vocalization features (Fig. 2A, top). By recording from three stages of the auditory hierarchy – the thalamus, the thalamorecipient layers (A1 L4), and the superficial layers (A1 L2/3) - we demonstrated such an increase between A1 L4 and A1 L2/3 in guinea pigs (Montes-Lourido et al., 2021). The receptive fields of A1 L4 neurons were selective for frequency (Fig. 2B, bottom), resulting in dense responses to many vocalization categories, whereas the receptive fields of A1 L2/3 neurons were selective for more complex features (Fig. 2B, top), resulting in selective responses to one or a few vocalization categories. Single spikes of A1 L2/3 neurons were more informative about vocalization category than the spikes of A1 L4 neurons. Thus, we could map the layers of the model architecture to A1 laminae and provide support for the model at the electrophysiological level. Note that our claim is not that the precise features identified by our algorithm are the ones represented in the cortex; rather, we argue that the algorithmic principle of selecting the most contrastive and least redundant features is also implemented in cortex. Further experiments are needed to quantify the information content and redundancy of experimentally recorded feature-selective neurons and determine how they compare with model feature-selective neurons. The neural mechanisms by which feature-selectivity in A1 L2/3 can be generated, how and where in auditory cortex feature-selective responses are combined to produce a categorical output, and whether the temporal ordering detected features is important for category determination are all critical open questions.

Figure 2: Schematic of feature-based hierarchical model for auditory categorization.

Figure 2:

(A) Model architecture. In the first model layer (i.e., the cochleagram), the spectrotemporal content of the stimulus is represented in a dense manner. Individual model excitatory neurons (gray) in this layer show classic frequency-selective receptive fields (simplified schematics of receptive fields shown as insets). In the second layer, individual model excitatory neurons are selective for stimulus features that are maximally informative for performing specific tasks such as the categorization of conspecific vocalizations. These model neurons show receptive fields of intermediate complexity (simplified schematics of feature-selective receptive fields are shown as insets). A third layer (not shown) pools detected features to output a category decision. Results from electrophysiological recordings suggest that the responses and receptive fields of actual neurons in A1 L4 and A1 2/3 are consistent with those of model neurons in the cochleagram and feature-selective layers. Putative inhibitory neurons in these layers (blue and green cells) modulate neuronal excitability. (B) Additive noise imposed by the listening environment alters the statistics of the auditory input (increase in mean and decrease in variance, histograms). In the input layer, putative inhibitory-neuron-based mechanisms help denoise the input representation by normalizing population activity (gain control). (C) Additive noise also decreases the ‘match’ between feature templates and the inputs, resulting in distributions of correlation values that are shifted to lower values (histograms). Cholinergic inputs recruited during increased listening effort may, through cortical disinhibitory circuit motifs, increase the excitability of feature-selective neurons and help separate within-category and outside-category stimuli.

Second, by replacing the categorical output stage with a trial-by-trial decision stage, the model could be used to predict behavioral responses to natural and modified vocalization stimuli in a Go/No-go task (Kar et al., 2022). We modeled the decision stage using a simple winner-take-all framework – for each category, all the features that spiked to a given stimulus were weighted by their log-likelihood ratios and summed to obtain the total evidence for the presence of each category, and the category with the highest evidence was taken to be model output category. We trained the model on categorization tasks using natural guinea pig vocalizations and tested the model with both novel natural vocalizations and vocalizations altered using various spectrotemporal manipulations. The categorization performance of the model was sensitive to some manipulations (addition of noise and frequency shifting; Fig. 2D, E) and insensitive to others (temporal stretching). We also trained guinea pigs on the same categorization tasks and tested them using the same manipulated stimuli. Guinea pigs showed similar sensitivity to the manipulations as the model (Fig. 2D, E), with the model explaining about 50% of the variance in guinea pig behavior (Kar et al., 2022). These experiments provided support for the model at the behavioral level. In addition, by examining which features the model relied on for task performance in these conditions, we could generate additional predictions regarding how features with distinct characteristics would be flexibly used in different contexts. Further experiments to test these predictions are underway. The decision stage may be conceptualized as a downstream area integrating feature-selective responses, or as a computation implemented as a mutual inhibition between feature-selective neurons in A1 L2/3 (Chettih and Harvey, 2019). Further experiments are necessary to distinguish between these possibilities.

The MIFs capture within-category variations that are random fluctuations about an archetypal feature that are caused during production. However, other variations that are imposed by the listening environment, such as the addition of noise or reverberations, do not follow this pattern. Noise results in both decreased overall cross-correlation values between the stimulus and the feature, as well as decreased separation between the within- and outside-category distributions (Fig. 1B, C). Thus, the MIF-based approach fails when no normalization is used for feature detection (Parida et al., 2022). This shortcoming could be ameliorated to an extent by training the MIFs in noisy and reverberant conditions, which is biologically realistic – it is reasonable to expect that during early development, animals acquire vocalization categories in a range of acoustic conditions. However, the benefits of training in one condition (say, noise) did not generalize to other conditions (say, reverberation). Therefore, we developed two modules based on experimental observations to improve model performance in adverse conditions. First, guided by contrast gain control studies (Rabinowitz et al., 2011; Willmore et al., 2014), we implemented normalization in the dense spectrotemporal layer (Fig. 1B). This normalization mimicked inhibitory neuron-based contrast gain control (Atallah et al., 2012; Carandini and Heeger, 1994; Wilson et al., 2012, but see Cooke et al., 2020) to maintain the overall activity level in the spectrotemporal stage across listening conditions. Shifts to the distribution of cross-correlation values in noisy conditions were lower after normalization, but the benefits of normalization varied widely across vocalization categories. Second, inspired by recent studies demonstrating how arousal may modulate cortical activity by activating specific inhibitory neuronal populations (vasointestinal peptide expressing or VIP interneurons) (Pi et al., 2013), we implemented the effects of listening effort as a top-down modulation of feature-detector excitability. Here, increased listening effort activated VIP interneurons that in turn inhibited other cortical interneurons, with the net effect being increased cortical excitability (Fig.1C). Thus, although the distributions of cross-correlation values were shifted to lower values, the thresholds of the feature detecting neurons were also proportionally decreased to values that better separated within- and outside-class distributions. This top-down approach resulted in more uniform benefits across noisy conditions, and sometimes out-performed animals in categorization-in-noise tasks (Parida et al., 2022). Interestingly, our implementation of top-down listening effort provided limited benefit in reverberant conditions, consistent with observations from human psychoacoustics (McCloy et al., 2017; Picou et al., 2016; Prodi and Visentin, 2022). Overall, these results highlight the modular nature of our model and demonstrate how experimental observations can be incorporated into the basic hierarchical architecture of the model.

7. Conclusion

In this review, we have discussed a selection of quantitative models ranging from those that use detailed biophysical data to explain experimental observations to those that use large-scale neural network approaches to solve specific tasks. These approaches are characterized by different motivations (modeling a behavioral process vs. exploring the necessity of a specific channel) and varied levels of abstraction (detailed biophysics vs. integrate-and-fire units). All of these models succeed at explaining specific experimental phenomena, but it is still not clear how these different models interact. Some examples suggest that the same underlying mechanisms could indeed contribute to different phenomena – for example, synaptic depression has been used to explain features of stimulus-specific adaptation (Lee and Sherman, 2008; Vanattou-Saïfoudine et al., 2021; Yarden and Nelken, 2017), and also to explain reverberation- and noise-resistant representation of sounds (David and Shamma, 2013; Mesgarani et al., 2014). Disinhibitory circuit motifs developed to explain the modulation of cortical activity by internal state (Letzkus et al., 2011; Pi et al., 2013) may also prove useful in auditory categorization in noise (Parida et al., 2022). A deeper exploration of such inter-relationships could lead to novel insight into auditory processing and generate novel predictions. As our ability to acquire large experimental datasets increases, and as tools to incorporate increasing levels of biological details into large-scale models become available (e.g., Olah et al., 2022), a close collaboration between modelers generating testable predictions and experimentalists testing these predictions and generating new data and constraints for modeling is indispensable. It is our hope that interest in and support for such collaborations will continue to increase in the coming years.

Figure 3: Model comparisons to electrophysiological and behavioral data.

Figure 3:

(A) (bottom) STRFs of two model units in the model’s dense cochleagram layer (Layer 1) and PSTHs (blue lines) of their responses to a sequence of guinea pig vocalizations. Gray dashed lines separate two example vocalizations from each labeled category. These model neurons exhibit compact, frequency-selective STRFs and respond to numerous vocalization types. (top) STRFs of a wheek MIF and a whine MIF from the feature-detector layer (Layer 2) of the model. These model neurons exhibit complex STRFs and respond to only a few call types. (B) Electrophysiological data recorded from neurons in A1 of awake, passively listening guinea pigs. (bottom) Like model Layer 1 neurons, A1 L4 neurons exhibited compact and frequency-selective STRFs and responded to numerous call types. (top) Like model Layer 2 neurons, many A1 L2/3 neurons exhibited complex STRFs and responded to only a few vocalization types. (C) We trained five models with non-overlapping MIF sets (black circles and lines) and a winner-take-all decision stage, and two guinea pigs (green and purple discs) on a categorization task in a Go/No-go framework, where the objective was to categorize wheek vocalizations from other (purr, chut, and whine) vocalizations. We measured performance using the sensitivity index d’. Guinea pigs learned the task in about a month of training (note that these two guinea pigs were previously trained on a wheek vs. whine task). Both guinea pigs as well as the model generalized to novel stimuli equally well. (D) The performance of both guinea pigs as well as the model degraded with the addition of white noise. Although the model overperformed in this task, it explained about 65% of the variance in guinea pig behavior as quantified using R-squared (R2). (E) The performance of both guinea pigs as well as the model were similarly affected by pitch-shifts to the vocalization stimuli. GP, Guinea pig; MAE, median absolute error.

Funding

We acknowledge support from the National Institutes of Health (NIDCD R01DC017141 and NIDCD R01DC013315), the Pennsylvania Lions Hearing Research Foundation, the Brain and Behavior Research Foundation, the Samuel and Emma Winters Foundation, and the University of Pittsburgh.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Declaration of Competing Interest

The authors have no competing interests to declare.

References

  1. Aasland WA, Baum SR, 2003. Temporal parameters as cues to phrasal boundaries: A comparison of processing by left- and right-hemisphere brain-damaged individuals. Brain Lang 87. 10.1016/S0093-934X(03)00138-X [DOI] [PubMed] [Google Scholar]
  2. Abbott LF, Varela JA, Sen K, Nelson SB, 1997. Synaptic depression and cortical gain control. Science (1979) 275, 220–224. 10.1126/SCIENCE.275.5297.221/ASSET/E3C5F0E7-B138-4359-8FAC-35FCEE6B3673/ASSETS/GRAPHIC/SE0174635003.JPEG [DOI] [PubMed] [Google Scholar]
  3. Aertsen AMHJ, Johannesma PIM, 1981a. A comparison of the Spectro-Temporal sensitivity of auditory neurons to tonal and natural stimuli. Biol Cybern 42. 10.1007/BF00336732 [DOI] [PubMed] [Google Scholar]
  4. Aertsen AMHJ, Johannesma PIM, 1981b. The Spectro-Temporal Receptive Field. Biol Cybern 42. 10.1007/bf00336731 [DOI] [PubMed] [Google Scholar]
  5. Aertsen AMHJ, Johannesma PIM, 1980. Spectro-temporal receptive fields of auditory neurons in the grassfrog - I. Characterization of tonal and natural stimuli. Biol Cybern 38. 10.1007/BF00337015 [DOI] [PubMed] [Google Scholar]
  6. Agamaite JA, Chang C-J, Osmanski MS, Wang X, 2015. A quantitative acoustic analysis of the vocal repertoire of the common marmoset (Callithrix jacchus). J Acoust Soc Am 138, 2906–2928. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Ahrens MB, Linden JF, Sahani M, 2008a. Nonlinearities and contextual influences in auditory cortical responses modeled with multilinear spectrotemporal methods. Journal of Neuroscience 28. 10.1523/JNEUROSCI.3377-07.2008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Ahrens MB, Paninski L, Sahani M, 2008b. Inferring input nonlinearities in neural encoding models. Network: Computation in Neural Systems 19. 10.1080/09548980701813936 [DOI] [PubMed] [Google Scholar]
  9. Aizenberg M, Mwilambwe-Tshilobo L, Briguglio JJ, Natan RG, Geffen MN, 2015. Bidirectional Regulation of Innate and Learned Behaviors That Rely on Frequency Discrimination by Cortical Inhibitory Neurons. PLoS Biol 13. 10.1371/journal.pbio.1002308 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Andoni S, Li N, Pollak GD, 2007. Behavioral/Systems/Cognitive Spectrotemporal Receptive Fields in the Inferior Colliculus Revealing Selectivity for Spectral Motion in Conspecific Vocalizations. 10.1523/JNEUROSCI.4342-06.2007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Aponte DA, Handy G, Kline AM, Tsukano H, Doiron B, Kato HK, 2021. Recurrent network dynamics shape direction selectivity in primary auditory cortex. Nat Commun 12. 10.1038/s41467-020-20590-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Arlotta P, Paşca SP, 2019. Cell diversity in the human cerebral cortex: from the embryo to brain organoids. Curr Opin Neurobiol. 10.1016/j.conb.2019.03.001 [DOI] [PubMed] [Google Scholar]
  13. Atallah BV, Bruns W, Carandini M, Scanziani M, 2012. Parvalbumin-Expressing Interneurons Linearly Transform Cortical Responses to Visual Stimuli. Neuron 73, 159–170. 10.1016/J.NEURON.2011.12.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Atencio CA, Sharpee TO, Schreiner CE, 2009. Hierarchical computation in the canonical auditory cortical circuit. Proc Natl Acad Sci U S A 106, 21894–21899. 10.1073/pnas.0908383106 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Atencio CA, Sharpee TO, Schreiner CE, 2008. Cooperative Nonlinearities in Auditory Cortical Neurons. Neuron 58. 10.1016/j.neuron.2008.04.026 [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Bandyopadhyay S, Reiss LAJ, Young ED, 2007. Receptive field for dorsal cochlear nucleus neurons at multiple sound levels. J Neurophysiol 98. 10.1152/jn.00539.2007 [DOI] [PubMed] [Google Scholar]
  17. Belin P, 2006. Voice processing in human and non-human primates. Philosophical Transactions of the Royal Society B: Biological Sciences. 10.1098/rstb.2006.1933 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Bendor D, 2015. The role of inhibition in a computational model of an auditory cortical neuron during the encoding of temporal information. PLoS Comput Biol 11. 10.1371/journal.pcbi.1004197 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Bendor D, Wang X, 2008. Neural response properties of primary, rostral, and rostrotemporal core fields in the auditory cortex of marmoset monkeys. J Neurophysiol 100. 10.1152/jn.00884.2007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Bendor D, Wang X, 2007. Differential neural coding of acoustic flutter within primate auditory cortex. Nat Neurosci 10. 10.1038/nn1888 [DOI] [PubMed] [Google Scholar]
  21. Berryman JC, 1976. Guinea-pig vocalizations: Their structure, causation and function. Z Tierpsychol 41, 80–106. [DOI] [PubMed] [Google Scholar]
  22. Bodin C, Trapeau R, Nazarian B, Sein J, Degiovanni X, Baurberg J, Rapha E, Renaud L, Giordano BL, Belin P, 2021. Functionally homologous representation of vocalizations in the auditory cortex of humans and macaques. Current Biology 31. 10.1016/j.cub.2021.08.043 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Bondanelli G, Deneux T, Bathellier B, Ostojic S, 2021. Network dynamics underlying OFF responses in the auditory cortex. Elife 10. 10.7554/eLife.53151 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Bruce IC, Erfani Y, Zilany MSA, 2018. A phenomenological model of the synapse between the inner hair cell and auditory nerve: Implications of limited neurotransmitter release sites. Hear Res 360, 40–54. 10.1016/J.HEARES.2017.12.016 [DOI] [PubMed] [Google Scholar]
  25. Brugge JF, Merzenich MM, 1973. Responses of neurons in auditory cortex of the macaque monkey to monaural and binaural stimulation. J Neurophysiol 36. 10.1152/jn.1973.36.6.1138 [DOI] [PubMed] [Google Scholar]
  26. Buonomano DV, 2000. Decoding temporal information: A model based on short-term synaptic plasticity. Journal of Neuroscience 20. 10.1523/jneurosci.20-03-01129.2000 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Buonomano DV, Merzenich MM, 1995. Temporal information transformed into a spatial code by a neural network with realistic properties. Science (1979) 267. 10.1126/science.7863330 [DOI] [PubMed] [Google Scholar]
  28. Caird D, Klinke R, 1983. Processing of binaural stimuli by cat superior olivary complex neurons. Experimental Brain Research 1983 52:3 52, 385–399. 10.1007/BF00238032 [DOI] [PubMed] [Google Scholar]
  29. Calabrese A, Schumacher JW, Schneider DM, Paninski L, Woolley SMN, 2011. A generalized linear model for estimating spectrotemporal receptive fields from responses to natural sounds. PLoS ONE 6: e16104. 10.1371/journal.pone.0016104 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Carandini M, Heeger DJ, 2012. Normalization as a canonical neural computation. Nat Rev Neurosci. 10.1038/nrn3136 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Carandini M, Heeger DJ, 1994. Summation and Division by Neurons in Primate Visual Cortex. Science (1979) 264, 1333–1336. 10.1126/SCIENCE.8191289 [DOI] [PubMed] [Google Scholar]
  32. Carney LH, Li T, McDonough JM, 2015. Speech Coding in the Brain: Representation of Vowel Formants by Midbrain Neurons Tuned to Sound Fluctuations. eNeuro 2, 4–15. 10.1523/ENEURO.0004-15.2015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Carney LH, Yin TCT, 1988. Temporal coding of resonances by low-frequency auditory nerve fibers: Single-fiber responses and a population model. J Neurophysiol 60, 1653–1677. 10.1152/JN.1988.60.5.1653 [DOI] [PubMed] [Google Scholar]
  34. Chettih SN, Harvey CD, 2019. Single-neuron perturbations reveal feature-specific competition in V1. Nature 2019 567:7748 567, 334–340. 10.1038/s41586-019-0997-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Christianson GB, Sahani M, Linden JF, 2008. The consequences of response nonlinearities for interpretation of spectrotemporal receptive fields. Journal of Neuroscience. 10.1523/JNEUROSCI.1775-07.2007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Cohen YE, Theunissen F, Russ BE, Gill P, 2007. Acoustic features of rhesus vocalizations and their representation in the ventrolateral prefrontal cortex. J Neurophysiol 97. 10.1152/jn.00769.2006 [DOI] [PubMed] [Google Scholar]
  37. Colburn HS, Carney LH, Heinz MG, 2003. Quantifying the information in auditory-nerve responses for level discrimination. JARO - Journal of the Association for Research in Otolaryngology 4, 294–311. 10.1007/S10162-002-1090-6/FIGURES/9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Cooke JE, King AJ, Willmore BDB, Schnupp JWH, 2018. Contrast gain control in mouse auditory cortex. J Neurophysiol 120. 10.1152/jn.00847.2017 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Cooke JE, Kahn MC, Mann EO, King AJ, Schnupp JWH, Willmore BDB, 2020. Contrast gain control occurs independently of both parvalbumin-positive interneuron activity and shunting inhibition in auditory cortex. J Neurophysiol 123: 1536–1551. 10.1152/jn.00587.2019 [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Dau T, Kollmeier B, Kohlrausch A, 1998a. Modeling auditory processing of amplitude modulation. II. Spectral and temporal integration. J Acoust Soc Am 102, 2906. 10.1121/1.420345 [DOI] [PubMed] [Google Scholar]
  41. Dau T, Kollmeier B, Kohlrausch A, 1998b. Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers. J Acoust Soc Am 102, 2892. 10.1121/1.420344 [DOI] [PubMed] [Google Scholar]
  42. David SV, Mesgarani N, Fritz JB, Shamma SA, 2009. Rapid synaptic depression explains nonlinear modulation of spectro-temporal tuning in primary auditory cortex by natural stimuli. Journal of Neuroscience 29. 10.1523/JNEUROSCI.5249-08.2009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. David SV, Mesgarani N, Shamma SA, 2007. Estimating sparse spectro-temporal receptive fields with natural stimuli, in: Network: Computation in Neural Systems. 10.1080/09548980701609235 [DOI] [PubMed] [Google Scholar]
  44. David SV, Shamma SA, 2013. Integration over multiple timescales in primary auditory cortex. Journal of Neuroscience 33. 10.1523/JNEUROSCI.2270-13.2013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Dean I, Robinson BL, Harper NS, McAlpine D, 2008. Rapid neural adaptation to sound level statistics. Journal of Neuroscience 28. 10.1523/JNEUROSCI.0470-08.2008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. de Boer E, 1969. Encoding of frequency information in the discharge pattern of auditory nerve fibers. International Audiology 8, 547–556. [Google Scholar]
  47. de Boer E, 1975. Synthetic whole-nerve action potentials for the cat. Journal of the Acoustical Society of America 58. 10.1121/1.380762 [DOI] [PubMed] [Google Scholar]
  48. de Boer E, 1991. Auditory physics. Physical principles in hearing theory. III. Phys Rep. 10.1016/0370-1573(91)90068-W [DOI] [Google Scholar]
  49. de Boer E, de Jongh HR, 1998. On cochlear encoding: Potentialities and limitations of the reverse-correlation technique. J Acoust Soc Am 63, 115. 10.1121/1.381704 [DOI] [PubMed] [Google Scholar]
  50. DeWeese MR, Zador AM, 2006. Non-gaussian membrane potential dynamics imply sparse, synchronous activity in auditory cortex. Journal of Neuroscience 26. 10.1523/JNEUROSCI.2813-06.2006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Dicke U, Ewert SD, Dau T, Kollmeier B, 2007. A neural circuit transforming temporal periodicity information into a rate-based representation in the mammalian auditory system. J Acoust Soc Am 121, 310. 10.1121/1.2400670 [DOI] [PubMed] [Google Scholar]
  52. Eggermont JJ, Johannesma PIM, Aertsen AMHJ, 1983. Reverse-correlation methods in auditory research. Q Rev Biophys 16, 341–414. 10.1017/S0033583500005126 [DOI] [PubMed] [Google Scholar]
  53. Escabí MA, Schreiner CE, 2002. Nonlinear Spectrotemporal Sound Analysis by Neurons in the Auditory Midbrain. Journal of Neuroscience 22, 4114–4131. 10.1523/JNEUROSCI.22-10-04114.2002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Espejo ML, Schwartz ZP, David SV, 2019. Spectral tuning of adaptation supports coding of sensory context in auditory cortex. PLoS Comput Biol 15. 10.1371/journal.pcbi.1007430 [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Feng L, Wang X, 2017. Harmonic template neurons in primate auditory cortex underlying complex sound processing. Proc Natl Acad Sci U S A 114. 10.1073/pnas.1607519114 [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Freiwald WA, Tsao DY, 2010. Functional compartmentalization and viewpoint generalization within the macaque face-processing system. Science (1979) 330. 10.1126/science.1194908 [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Fritz J, Shamma S, Elhilali M, Klein D, 2003. Rapid task-related plasticity of spectrotemporal receptive fields in primary auditory cortex. Nature Neuroscience 2003 6:11 6, 1216–1223. 10.1038/nn1141 [DOI] [PubMed] [Google Scholar]
  58. Gao X, Wehr M, 2015. A coding transformation for temporally structured sounds within auditory cortical neurons. Neuron 86. 10.1016/j.neuron.2015.03.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Gifford GW, MacLean KA, Hauser MD, Cohen YE, 2005. The neurophysiology of functionally meaningful categories: Macaque ventrolateral prefrontal cortex plays a critical role in spontaneous categorization of species-specific vocalizations. J Cogn Neurosci 17. 10.1162/0898929054985464 [DOI] [PubMed] [Google Scholar]
  60. Gill P, Woolley SMN, Fremouw T, Theunissen FE, 2008. What’s that sound? Auditory area CLM encodes stimulus surprise, not intensity or intensity changes. J Neurophysiol 99. 10.1152/jn.01270.2007 [DOI] [PubMed] [Google Scholar]
  61. Gill P, Zhang J, Woolley SMN, Fremouw T, Theunissen FE, 2006. Sound representation methods for spectro-temporal receptive field estimation. J Comput Neurosci 21. 10.1007/s10827-006-7059-4 [DOI] [PubMed] [Google Scholar]
  62. Grimsley JMS, Shanbhag SJ, Palmer AR, Wallace MN, 2012. Processing of Communication Calls in Guinea Pig Auditory Cortex. PLoS One 7. 10.1371/journal.pone.0051646 [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Grothe B, 2003. New roles for synaptic inhibition in sound localization. Nature Reviews Neuroscience 2003 4:7 4, 540–550. 10.1038/nrn1136 [DOI] [PubMed] [Google Scholar]
  64. Güçlü U, Thielen J, Hanke M, van Gerven M, 2016. Brains on beats. Adv Neural Inf Process Syst 29. [Google Scholar]
  65. Harper NS, Schoppe O, Willmore BDB, Cui Z, Schnupp JWH, King AJ, 2016. Network Receptive Field Modeling Reveals Extensive Integration and Multi-feature Selectivity in Auditory Cortical Neurons. PLoS Comput Biol 12. 10.1371/journal.pcbi.1005113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Harris KD, Bartho P, Chadderton P, Curto C, de la Rocha J, Hollender L, Itskov V, Luczak A, Marguet SL, Renart A, Sakata S, 2011. How do neurons work together? Lessons from auditory cortex. Hear Res 271. 10.1016/j.heares.2010.06.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Heinz MG, Colburn HS, Carney LH, 2001. Evaluating Auditory Performance Limits: I. One-Parameter Discrimination Using a Computational Model for the Auditory Nerve. Neural Comput 13, 2273–2316. 10.1162/089976601750541804 [DOI] [PubMed] [Google Scholar]
  68. Hershey JR, Chen Z, le Roux J, Watanabe S, 2016. Deep clustering: Discriminative embeddings for segmentation and separation, in: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. 10.1109/ICASSP.2016.7471631 [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Hromádka T, Zador AM, de Weese MR, 2013. Up states are rare in awake auditory cortex. J Neurophysiol 109. 10.1152/jn.00600.2012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Huang N, Slaney M, Elhilali M, 2018. Connecting deep neural networks to physical, perceptual, and electrophysiological auditory signals. Front Neurosci 12, 532. 10.3389/FNINS.2018.00532/BIBTEX [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Janata P, Grafton ST, 2003. Swinging in the brain: Shared neural substrates for behaviors related to sequencing and music. Nat Neurosci. 10.1038/nn1081 [DOI] [PubMed] [Google Scholar]
  72. Jeffress LA, 1948. A place theory of sound localization. J Comp Physiol Psychol 41, 35–39. 10.1037/H0061495 [DOI] [PubMed] [Google Scholar]
  73. Jørgensen S, Dau T, 2011. Predicting speech intelligibility based on the signal-to-noise envelope power ratio after modulation-frequency selective processing. J Acoust Soc Am 130, 1475. 10.1121/1.3621502 [DOI] [PubMed] [Google Scholar]
  74. Joris PX, Yin TCT, 1998. Responses to amplitude-modulated tones in the auditory nerve of the cat. J Acoust Soc Am 91, 215. 10.1121/1.402757 [DOI] [PubMed] [Google Scholar]
  75. Kandler K, Gillespie DC, 2005. Developmental refinement of inhibitory sound-localization circuits. Trends Neurosci 28, 290–296. 10.1016/J.TINS.2005.04.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  76. Kar M, Pernia M, Williams K, Parida S, Schneider NA, McAndrew M, Kumbam I, Sadagopan S, 2022. Vocalization categorization behavior explained by a feature-based auditory categorization model. Elife 11, e78278. 10.7554/eLife.78278 [DOI] [PMC free article] [PubMed] [Google Scholar]
  77. Kato HK, Asinof SK, Isaacson JS, 2017. Network-Level Control of Frequency Tuning in Auditory Cortex. Neuron 95. 10.1016/j.neuron.2017.06.019 [DOI] [PMC free article] [PubMed] [Google Scholar]
  78. Kell AJE, Yamins DLK, Shook EN, Norman-Haignere SV, McDermott JH, 2018. A Task-Optimized Neural Network Replicates Human Auditory Behavior, Predicts Brain Responses, and Reveals a Cortical Processing Hierarchy. Neuron 98. 10.1016/j.neuron.2018.03.044 [DOI] [PubMed] [Google Scholar]
  79. Kell AJ, McDermott JH, 2019. Deep neural network models of sensory systems: windows onto the role of task constraints. Curr Opin Neurobiol. 10.1016/j.conb.2019.02.003 [DOI] [PubMed] [Google Scholar]
  80. Keshishian M, Akbari H, Khalighinejad B, Herrero JL, Mehta AD, Mesgarani N, 2020. Estimating and interpreting nonlinear receptive field of sensory neural responses with deep neural network models. Elife 9. 10.7554/eLife.53445 [DOI] [PMC free article] [PubMed] [Google Scholar]
  81. Khatami F, Escabi MA, 2020. Spiking network optimized for word recognition in noise predicts auditory system hierarchy. PLoS Comput Biol 16. 10.1371/journal.pcbi.1007558 [DOI] [PMC free article] [PubMed] [Google Scholar]
  82. Klein DJ, Depireux DA, Simon JZ, Shamma SA, 2000. Robust spectrotemporal reverse correlation for the auditory system: Optimizing stimulus design. J Comput Neurosci. 10.1023/A:1008990412183 [DOI] [PubMed] [Google Scholar]
  83. Kowalski N, Depireux DA, Shamma SA, 1996a. Analysis of dynamic spectra in ferret primary auditory cortex. II. Prediction of unit responses to arbitrary dynamic spectra. J Neurophysiol 76. 10.1152/jn.1996.76.5.3524 [DOI] [PubMed] [Google Scholar]
  84. Kowalski N, Depireux DA, Shamma SA, 1996b. Analysis of dynamic spectra in ferret primary auditory cortex. I. Characteristics of single-unit responses to moving ripple spectra. J Neurophysiol 76. 10.1152/jn.1996.76.5.3503 [DOI] [PubMed] [Google Scholar]
  85. Kozlov AS, Gentner TQ, 2016. Central auditory neurons have composite receptive fields. Proc Natl Acad Sci U S A 113. 10.1073/pnas.1506903113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  86. Krishna BS, Semple MN, 2000. Auditory temporal processing: Responses to sinusoidally amplitude-modulated tones in the inferior colliculus. J Neurophysiol 84, 255–273. 10.1152/JN.2000.84.1.255/ASSET/IMAGES/LARGE/9K0701097018.JPEG [DOI] [PubMed] [Google Scholar]
  87. Kuchibhotla KV, Gill J. v., Lindsay GW, Papadoyannis ES, Field RE, Sten TAH, Miller KD, Froemke RC, 2017. Parallel processing by cortical inhibition enables context-dependent behavior. Nat Neurosci 20, 62–71. 10.1038/nn.4436 [DOI] [PMC free article] [PubMed] [Google Scholar]
  88. Langner G, Schreiner CE, 1988. Periodicity coding in the inferior colliculus of the cat. I. Neuronal mechanisms. 10.1152/jn.1988.60.6.1799 60, 1799–1822. [DOI] [PubMed] [Google Scholar]
  89. Latham PE, Nirenberg S, 2004. Computing and stability in cortical networks. Neural Comput 16. 10.1162/089976604323057434 [DOI] [PubMed] [Google Scholar]
  90. Laudanski J, Edeline JM, Huetz C, 2012. Differences between Spectro-Temporal Receptive Fields Derived from Artificial and Natural Stimuli in the Auditory Cortex. PLoS One 7. 10.1371/journal.pone.0050539 [DOI] [PMC free article] [PubMed] [Google Scholar]
  91. Lee CC, Sherman SM, 2008. Synaptic properties of thalamic and intracortical inputs to layer 4 of the first- and higher-order cortical areas in the auditory and somatosensory systems. J Neurophysiol 100, 317–326. 10.1152/JN.90391.2008/ASSET/IMAGES/LARGE/Z9K0070889050008.JPEG [DOI] [PMC free article] [PubMed] [Google Scholar]
  92. Lee H, Yan L, Pham P, Ng AY, 2009. Unsupervised feature learning for audio classification using convolutional deep belief networks, in: Advances in Neural Information Processing Systems 22 - Proceedings of the 2009 Conference. [Google Scholar]
  93. Lee JH, Wang X, Bendor D, 2020. The role of adaptation in generating monotonic rate codes in auditory cortex. PLoS Comput Biol 16. 10.1371/journal.pcbi.1007627 [DOI] [PMC free article] [PubMed] [Google Scholar]
  94. Lemus L, Hernández A, Romo R, 2009. Neural codes for perceptual discrimination of acoustic flutter in the primate auditory cortex. Proc Natl Acad Sci U S A 106. 10.1073/pnas.0904066106 [DOI] [PMC free article] [PubMed] [Google Scholar]
  95. Letzkus JJ, Wolff SBE, Meyer EMM, Tovote P, Courtin J, Herry C, Lüthi A, 2011. A disinhibitory microcircuit for associative fear learning in the auditory cortex. Nature 2011 480:7377 480, 331–335. 10.1038/nature10674 [DOI] [PubMed] [Google Scholar]
  96. Liang L, Lu T, Wang X, 2002. Neural representations of sinusoidal amplitude and frequency modulations in the primary auditory cortex of awake primates. J Neurophysiol 87. 10.1152/jn.2002.87.5.2237 [DOI] [PubMed] [Google Scholar]
  97. Liberman MC, 1978. Auditory-nerve response from cats raised in a low-noise chamber. J Acoust Soc Am 63, 442. 10.1121/1.381736 [DOI] [PubMed] [Google Scholar]
  98. Liu ST, Montes-Lourido P, Wang X, Sadagopan S, 2019. Optimal features for auditory categorization. Nat Commun 10. 10.1038/s41467-019-09115-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  99. Liu X-P, Wang X, 2022. Distinct neuronal types contribute to hybrid temporal encoding strategies in primate auditory cortex. 10.1371/journal.pbio.3001642 [DOI] [PMC free article] [PubMed] [Google Scholar]
  100. Luo Y, Chen Z, Mesgarani N, 2018. Speaker-Independent Speech Separation with Deep Attractor Network. IEEE/ACM Trans Audio Speech Lang Process 26. 10.1109/TASLP.2018.2795749 [DOI] [Google Scholar]
  101. Lu T, Liang L, Wang X, 2001. Temporal and rate representations of time-varying signals in the auditory cortex of awake primates. Nat Neurosci 4. 10.1038/nn737 [DOI] [PubMed] [Google Scholar]
  102. Lu T, Wang X, 2004. Information Content of Auditory Cortical Responses to Time-Varying Acoustic Stimuli. J Neurophysiol 91. 10.1152/jn.00022.2003 [DOI] [PubMed] [Google Scholar]
  103. Machens CK, Wehr MS, Zador AM, 2004. Linearity of Cortical Receptive Fields Measured with Natural Sounds. Journal of Neuroscience 24. 10.1523/JNEUROSCI.4445-03.2004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  104. Magnuson JS, You H, Luthra S, Li M, Nam H, Escabí M, Brown K, Allopenna PD, Theodore RM, Monto N, Rueckl JG, 2020. EARSHOT: A Minimal Neural Network Model of Incremental Human Speech Recognition. Cogn Sci 44. 10.1111/cogs.12823 [DOI] [PubMed] [Google Scholar]
  105. Marr D, 1982. Vision: A computational investigation into the human representation and processing of visual information.
  106. Mathews PJ, Jercog PE, Rinzel J, Scott LL, Golding NL, 2010. Control of submillisecond synaptic timing in binaural coincidence detectors by Kv1 channels. Nature Neuroscience 2010 13:5 13, 601–609. 10.1038/nn.2530 [DOI] [PMC free article] [PubMed] [Google Scholar]
  107. McCloy DR, Lau BK, Larson E, Pratt KAI, Lee AKC, 2017. Pupillometry shows the effort of auditory attention switchinga). J Acoust Soc Am 141, 2440. 10.1121/1.4979340 [DOI] [PMC free article] [PubMed] [Google Scholar]
  108. Mesgarani N, David SV, Fritz JB, Shamma SA, 2014. Mechanisms of noise robust representation of speech in primary auditory cortex. Proc Natl Acad Sci U S A 111, 6792–6797. 10.1073/PNAS.1318017111/SUPPL_FILE/PNAS.201318017SI.PDF [DOI] [PMC free article] [PubMed] [Google Scholar]
  109. Migliore M, Shepherd GM, 2002. Emerging rules for the distributions of active dendritic conductances. Nature Reviews Neuroscience 2002 3:5 3, 362–370. 10.1038/nrn810 [DOI] [PubMed] [Google Scholar]
  110. Miller LM, Escabí MA, Read HL, Schreiner CE, 2002. Spectrotemporal receptive fields in the lemniscal auditory thalamus and cortex. J Neurophysiol 87. 10.1152/jn.00395.2001 [DOI] [PubMed] [Google Scholar]
  111. Młynarski W, McDermott JH, 2018. Learning Midlevel Auditory Codes from Natural Sound Statistics. Neural Comput 30, 631–669. 10.1162/NECO_A_01048 [DOI] [PubMed] [Google Scholar]
  112. Montes-Lourido P, Kar M, David SV, Sadagopan S, 2021. Neuronal selectivity to complex vocalization features emerges in the superficial layers of primary auditory cortex. PLoS Biol 19: e3001299. 10.1371/journal.pbio.3001299 [DOI] [PMC free article] [PubMed] [Google Scholar]
  113. Motanis H, Seay MJ, Buonomano DV, 2018. Short-term synaptic plasticity as a mechanism for sensory timing. Trends Neurosci 41: 701–711. [DOI] [PMC free article] [PubMed] [Google Scholar]
  114. Moskovitz T, Roy N, Pillow J, 2018. A comparison of deep learning and linear-nonlinear cascade approaches to neural encoding. BioRxiv 463422. 10.1101/463422 [DOI] [Google Scholar]
  115. Nelson PC, Carney LH, 2004. A phenomenological model of peripheral and central neural responses to amplitude-modulated tones. J Acoust Soc Am 116, 2173. 10.1121/1.1784442 [DOI] [PMC free article] [PubMed] [Google Scholar]
  116. Norman-Haignere SV, Kanwisher N, McDermott JH, 2013. Cortical pitch regions in humans respond primarily to resolved harmonics and are located in specific tonotopic regions of anterior auditory cortex. Journal of Neuroscience 33. 10.1523/JNEUROSCI.2880-13.2013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  117. Olah VJ, Pedersen NP, Rowan MJM, 2022. Ultrafast simulation of large-scale neocortical microcircuitry with biologically realistic neurons. Elife 11:e79535. 10.7554/eLife.79535 [DOI] [PMC free article] [PubMed] [Google Scholar]
  118. Osen KK, 1969. Cytoarchitecture of the cochlear nuclei in the cat. Journal of Comparative Neurology 136. 10.1002/cne.901360407 [DOI] [PubMed] [Google Scholar]
  119. Ozeki H, Finn IM, Schaffer ES, Miller KD, Ferster D, 2009. Inhibitory Stabilization of the Cortical Network Underlies Visual Surround Suppression. Neuron 62. 10.1016/j.neuron.2009.03.028 [DOI] [PMC free article] [PubMed] [Google Scholar]
  120. Pachitariu M, Lyamzin DR, Sahani M, Lesica NA, 2015. State-dependent population coding in primary auditory cortex. J Neurosci 35: 2058–73. 10.1523/jneurosci.3318-14.2015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  121. Paninski L, Pillow J, Lewi J, 2007. Statistical models for neural encoding, decoding, and optimal stimulus design. Prog Brain Res. 10.1016/S0079-6123(06)65031-0 [DOI] [PubMed] [Google Scholar]
  122. Panzeri S, Macke JH, Gross J, Kayser C, 2015. Neural population coding: combining insights from microscopic and mass signals. Trends Cogn Sci 19, 162–172. 10.1016/J.TICS.2015.01.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  123. Parida S, Liu ST, Sadagopan S, 2022. Adaptive mechanisms facilitate robust performance in noise and in reverberation in an auditory categorization model. bioRxiv 2022.09.25.509412; doi: 10.1101/2022.09.25.509412 [DOI] [PMC free article] [PubMed] [Google Scholar]
  124. Patterson RD, Uppenkamp S, Johnsrude IS, Griffiths TD, 2002. The processing of temporal pitch and melody information in auditory cortex. Neuron 36. 10.1016/S0896-6273(02)01060-7 [DOI] [PubMed] [Google Scholar]
  125. Patterson R, Nimmo-Smith I, Holdsworth J, Rice P, Cambride MQ, 1987. An efficient auditory filterbank based on the gammatone function. pdn.cam.ac.uk.
  126. Penagos H, Melcher JR, Oxenham AJ, 2004. A neural representation of pitch salience in nonprimary human auditory cortex revealed with functional magnetic resonance imaging. Journal of Neuroscience 24. 10.1523/JNEUROSCI.0383-04.2004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  127. Pennington JR, David S V, 2022. Can deep learning provide a generalizable model for dynamic sound encoding in auditory cortex? bioRxiv. [Google Scholar]
  128. Pennington JR, David SV, 2020. Complementary effects of adaptation and gain control on sound encoding in primary auditory cortex. eNeuro 7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  129. Perrodin C, Kayser C, Logothetis NK, Petkov CI, 2011. Voice cells in the primate temporal lobe. Current Biology 21, 1408–1415. [DOI] [PMC free article] [PubMed] [Google Scholar]
  130. Peruzzi D, Sivaramakrishnan S, Oliver DL, 2000. Identification of cell types in brain slices of the inferior colliculus. Neuroscience 101. 10.1016/S0306-4522(00)00382-1 [DOI] [PubMed] [Google Scholar]
  131. Petkov CI, Kayser C, Steudel T, Whittingstall K, Augath M, Logothetis NK, 2008. A voice region in the monkey brain 11, 367–374. 10.1038/nn2043 [DOI] [PubMed] [Google Scholar]
  132. Pfingst BE, O’Connor TA, 1981. Characteristics of neurons in auditory cortex of monkeys performing a simple auditory task. J Neurophysiol 45. 10.1152/jn.1981.45.1.16 [DOI] [PubMed] [Google Scholar]
  133. Phillips EAK, Hasenstaub AR, 2016. Asymmetric effects of activating and inactivating cortical interneurons. Elife 5. 10.7554/eLife.18383 [DOI] [PMC free article] [PubMed] [Google Scholar]
  134. Phillips EAK, Schreiner CE, Hasenstaub AR, 2017a. Cortical Interneurons Differentially Regulate the Effects of Acoustic Context. Cell Rep 20, 771–778. 10.1016/J.CELREP.2017.07.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  135. Phillips EAK, Schreiner CE, Hasenstaub AR, 2017b. Diverse effects of stimulus history in waking mouse auditory cortex. J Neurophysiol 118, 1376–1393. 10.1152/JN.00094.2017/ASSET/IMAGES/LARGE/Z9K0081742350010.JPEG [DOI] [PMC free article] [PubMed] [Google Scholar]
  136. Picou EM, Gordon J, Ricketts TA, 2016. The effects of noise and reverberation on listening effort in adults with normal hearing. Ear Hear 37, 1–13. 10.1097/AUD.0000000000000222 [DOI] [PMC free article] [PubMed] [Google Scholar]
  137. Pi HJ, Hangya B, Kvitsiani D, Sanders JI, Huang ZJ, Kepecs A, 2013. Cortical interneurons that specialize in disinhibitory control. Nature 2013 503:7477 503, 521–524. 10.1038/nature12676 [DOI] [PMC free article] [PubMed] [Google Scholar]
  138. Polley DB, Heiser MA, Blake DT, Schreiner CE, Merzenich MM, 2004. Associative learning shapes the neural code for stimulus magnitude in primary auditory cortex. Proc Natl Acad Sci U S A 101. 10.1073/pnas.0407586101 [DOI] [PMC free article] [PubMed] [Google Scholar]
  139. Polley DB, Read HL, Storace DA, Merzenich MM, 2007. Multiparametric auditory receptive field organization across five cortical fields in the albino rat. J Neurophysiol. 10.1152/jn.01298.2006 [DOI] [PubMed] [Google Scholar]
  140. Prodi N, Visentin C, 2022. A Slight Increase in Reverberation Time in the Classroom Affects Performance and Behavioral Listening Effort. Ear Hear 43, 460–476. 10.1097/AUD.0000000000001110 [DOI] [PubMed] [Google Scholar]
  141. Rabinowitz NC, Willmore BDB, Schnupp JWH, King AJ, 2012. Spectrotemporal contrast kernels for neurons in primary auditory cortex. Journal of Neuroscience 32. 10.1523/JNEUROSCI.1715-12.2012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  142. Rabinowitz NC, Willmore BDB, Schnupp JWH, King AJ, 2011. Contrast Gain Control in Auditory Cortex. Neuron 70. 10.1016/j.neuron.2011.04.030 [DOI] [PMC free article] [PubMed] [Google Scholar]
  143. Rahman M, Willmore BDB, King AJ, Harper NS, 2020. Simple transformations capture auditory input to cortex. Proc Natl Acad Sci U S A 117. 10.1073/pnas.1922033117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  144. Rahman M, Willmore BDB, King AJ, Harper NS, 2019. A dynamic network model of temporal receptive fields in primary auditory cortex. PLoS Comput Biol 15. 10.1371/journal.pcbi.1006618 [DOI] [PMC free article] [PubMed] [Google Scholar]
  145. Rauschecker JP, Tian B, 2000. Mechanisms and streams for processing of “what” and “where” in auditory cortex. Proc Natl Acad Sci U S A. 10.1073/pnas.97.22.11800 [DOI] [PMC free article] [PubMed] [Google Scholar]
  146. Rauschecker JP, Tian B, Hauser M, 1995. Processing of complex sounds in the macaque nonprimary auditory cortex. Science (1979) 268. 10.1126/science.7701330 [DOI] [PubMed] [Google Scholar]
  147. Richards BA, Lillicrap TP, Beaudoin P, Bengio Y, Bogacz R, Christensen A, Clopath C, Costa RP, de Berker A, Ganguli S, Gillon CJ, Hafner D, Kepecs A, Kriegeskorte N, Latham P, Lindsay GW, Miller KD, Naud R, Pack CC, Poirazi P, Roelfsema P, Sacramento J, Saxe A, Scellier B, Schapiro AC, Senn W, Wayne G, Yamins D, Zenke F, Zylberberg J, Therien D, Kording KP, 2019. A deep learning framework for neuroscience. Nature Neuroscience 2019 22:11 22, 1761–1770. 10.1038/s41593-019-0520-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  148. Romanski LM, Goldman-Rakic PS, 2002. An auditory domain in primate prefrontal cortex. Nat Neurosci 5. 10.1038/nn781 [DOI] [PMC free article] [PubMed] [Google Scholar]
  149. Rowekamp RJ, Sharpee TO, 2011. Analyzing multicomponent receptive fields from neural responses to natural stimuli. Network: Computation in Neural Systems. 10.3109/0954898X.2011.566303 [DOI] [PMC free article] [PubMed] [Google Scholar]
  150. Runyan CA, Piasini E, Panzeri S, Harvey CD, 2017. Distinct timescales of population coding across cortex. Nature 548. 10.1038/nature23020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  151. Sadagopan S, Temiz-Karayol NZ, Voss HU, 2015. High-field functional magnetic resonance imaging of vocalization processing in marmosets. Sci Rep 5, 10950. [DOI] [PMC free article] [PubMed] [Google Scholar]
  152. Sadagopan S, Wang X, 2010. Contribution of inhibition to stimulus selectivity in primary auditory cortex of awake primates. Journal of Neuroscience 30, 7314–7325. [DOI] [PMC free article] [PubMed] [Google Scholar]
  153. Sadagopan S, Wang X, 2009. Nonlinear spectrotemporal interactions underlying selectivity for complex sounds in auditory cortex. Journal of Neuroscience 29, 11192–11202. 10.1523/JNEUROSCI.1286-09.2009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  154. Sadagopan S, Wang X, 2008. Level invariant representation of sounds by populations of neurons in primary auditory cortex. Journal of Neuroscience 28, 3415–3426. [DOI] [PMC free article] [PubMed] [Google Scholar]
  155. Saddler MR, Gonzalez R, McDermott JH, 2021. Deep neural network models reveal interplay of peripheral coding and stimulus statistics in pitch perception. Nature Communications 2022 12:1 12, 1–25. 10.1038/s41467-021-27366-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  156. Saderi D, Schwartz ZP, Heller CR, Pennington JR, David SV, 2021. Dissociation of task engagement and arousal effects in auditory cortex and midbrain. Elife 10, 1–25. 10.7554/eLife.60153 [DOI] [PMC free article] [PubMed] [Google Scholar]
  157. Sahani M, Linden JF, 2003. How linear are auditory cortical responses?, in: Advances in Neural Information Processing Systems. [Google Scholar]
  158. Sainath TN, Kanevsky D, Iyengar G, 2007. Unsupervised audio segmentation using extended Baum-Welch transformations, in: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. 10.1109/ICASSP.2007.366653 [DOI] [Google Scholar]
  159. Saremi A, Beutelmann R, Dietz M, Ashida G, Kretzberg J, Verhulst S, 2016. A comparative study of seven human cochlear filter models. J Acoust Soc Am 140, 1618. 10.1121/1.4960486 [DOI] [PubMed] [Google Scholar]
  160. Scheidiger C, Carney LH, Dau T, Zaar J, 2018. Predicting Speech Intelligibility Based on Across-Frequency Contrast in Simulated Auditory-Nerve Fluctuations. Acta Acustica united with Acustica 104, 914–917. 10.3813/AAA.919245 [DOI] [PMC free article] [PubMed] [Google Scholar]
  161. Schnupp JWHD, Mrsic-Flogel T, King AJ, 2001. Linear processing of spatial cues in primary auditory cortex. Nature 414. 10.1038/35102568 [DOI] [PubMed] [Google Scholar]
  162. Schrimpf M, Kubilius J, Hong H, Majaj NJ, Rajalingham R, Issa EB, Kar K, Bashivan P, Prescott-Roy J, Geiger F, Schmidt K, Yamins DLK, DiCarlo JJ, 2020. Brain-Score: Which Artificial Neural Network for Object Recognition is most Brain-Like? bioRxiv 407007. 10.1101/407007 [DOI] [Google Scholar]
  163. Schwartz O, Pillow JW, Rust NC, Simoncelli EP, 2006. Spike-triggered neural characterization. J Vis 6. 10.1167/6.4.13 [DOI] [PubMed] [Google Scholar]
  164. Seay MJ, Natan RG, Geffen MN, Buonomano DV, 2020. Differential short-term plasticity of PV and SST neurons accounts for adaptation and facilitation of cortical neurons to auditory tones. Journal of Neuroscience 40. 10.1523/JNEUROSCI.0686-20.2020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  165. Seybold BA, Phillips EAK, Schreiner CE, Hasenstaub AR, 2015. Inhibitory Actions Unified by Network Integration. Neuron 87. 10.1016/j.neuron.2015.09.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  166. Sharpee TO, 2013. Computational identification of receptive fields. Annu Rev Neurosci. 10.1146/annurev-neuro-062012-170253 [DOI] [PMC free article] [PubMed] [Google Scholar]
  167. Sharpee TO, Atencio CA, Schreiner CE, 2011. Hierarchical representations in the auditory cortex. Curr Opin Neurobiol 21, 761–767. 10.1016/J.CONB.2011.05.027 [DOI] [PMC free article] [PubMed] [Google Scholar]
  168. Sharpee T, Rust NC, Bialek W, 2004. Analyzing Neural Responses to Natural Signals: Maximally Informative Dimensions. Neural Comput 16. 10.1162/089976604322742010 [DOI] [PubMed] [Google Scholar]
  169. Simpson AJR, Roma G, Plumbley MD, 2015. Deep karaoke: Extracting vocals from musical mixtures using a convolutional deep neural network, in: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 10.1007/978-3-319-22482-4_50 [DOI] [Google Scholar]
  170. Smith EC, Lewicki MS, 2006. Efficient auditory coding. Nature 2006 439:7079 439, 978–982. 10.1038/nature04485 [DOI] [PubMed] [Google Scholar]
  171. Stefanini F, Kushnir L, Jimenez JC, Jennings JH, Woods NI, Stuber GD, Kheirbek MA, Hen R, Fusi S, 2020. A Distributed Neural Code in the Dentate Gyrus and in CA1. Neuron 107, 703–716.e4. 10.1016/J.NEURON.2020.05.022 [DOI] [PMC free article] [PubMed] [Google Scholar]
  172. Stevenson IH, London BM, Oby ER, Sachs NA, Reimer J, Englitz B, David S. v., Shamma SA, Blanche TJ, Mizuseki K, Zandvakili A, Hatsopoulos NG, Miller LE, Kording KP, 2012. Functional Connectivity and Tuning Curves in Populations of Simultaneously Recorded Neurons. PLoS Comput Biol 8. 10.1371/journal.pcbi.1002775 [DOI] [PMC free article] [PubMed] [Google Scholar]
  173. Tallal P, 1994. In the Perception of Speech Time is of the Essence. 10.1007/978-3-642-85148-3_16 [DOI] [Google Scholar]
  174. Tang C, Hamilton LS, Chang EF, 2017. Intonational speech prosody encoding in the human auditory cortex. Science (1979) 357. 10.1126/science.aam8577 [DOI] [PMC free article] [PubMed] [Google Scholar]
  175. Tan X, Young H, Matic AI, Zirkle W, Rajguru S, Richter CP, 2015. Temporal properties of inferior colliculus neurons to photonic stimulation in the cochlea. Physiol Rep 3. 10.14814/phy2.12491 [DOI] [PMC free article] [PubMed] [Google Scholar]
  176. Theunissen FE, Sen K, Doupe AJ, 2000. Spectral-temporal receptive fields of nonlinear auditory neurons obtained using natural sounds. Journal of Neuroscience 20. 10.1523/jneurosci.20-06-02315.2000 [DOI] [PMC free article] [PubMed] [Google Scholar]
  177. Thomas S, Suzuki M, Huang Y, Kurata G, Tuske Z, et al. , 2019. English broadcast news speech recognition by humans and machines. arXiv: 1904.13258 10.48550/arXiv.1904.13258 [DOI] [Google Scholar]
  178. Tian B, Reser D, Durham A, Kustov A, Rauschecker JP, 2001. Functional specialization in rhesus monkey auditory cortex. Science (1979) 292, 290–293. 10.1126/science.1058911 [DOI] [PubMed] [Google Scholar]
  179. Tsao DY, Freiwald WA, Knutsen TA, Mandeville JB, Tootell RBH, 2003. Faces and objects in macaque cerebral cortex. Nat Neurosci 6. 10.1038/nn1111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  180. Tsao DY, Livingstone MS, 2008. Mechanisms of face perception. Annu Rev Neurosci. 10.1146/annurev.neuro.30.051606.094238 [DOI] [PMC free article] [PubMed] [Google Scholar]
  181. Tsodyks M. v., Skaggs WE, Sejnowski TJ, McNaughton BL, 1997. Paradoxical effects of external modulation of inhibitory interneurons. Journal of Neuroscience 17. 10.1523/jneurosci.17-11-04382.1997 [DOI] [PMC free article] [PubMed] [Google Scholar]
  182. Ullman S, Assif L, Fetaya E, Harari D, 2016. Atoms of recognition in human and computer vision. Proc Natl Acad Sci U S A 113. 10.1073/pnas.1513198113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  183. Ullman S, Vidal-Naquet M, Sali E, 2002. Visual features of intermediate complexity and their use in classification. Nat Neurosci 5. 10.1038/nn870 [DOI] [PubMed] [Google Scholar]
  184. Wang X-J, Hu H, Huang C, Kennedy H, Tony Li C, Logothetis N, Lu Z-L, Luo Q, Poo M, Tsao D, Wu S, Wu Z, Zhang X, Zhou D, 2020. Computational neuroscience: a frontier of the 21st century. Natl Sci Rev 7, 1418–1422. 10.1093/NSR/NWAA129 [DOI] [PMC free article] [PubMed] [Google Scholar]
  185. Weerts L, Rosen S, Clopath C, Goodman DFM, 2022. The Psychometrics of Automatic Speech Recognition. bioRxiv 2021.04.19.440438 10.1101/2021.04.19.440438 [DOI] [Google Scholar]
  186. Wehr M, Zador AM, 2005. Synaptic mechanisms of forward suppression in rat auditory cortex. Neuron 47. 10.1016/j.neuron.2005.06.009 [DOI] [PubMed] [Google Scholar]
  187. Williamson RS, Ahrens MB, Linden JF, Sahani M, 2016. Input-Specific Gain Modulation by Local Sensory Context Shapes Cortical and Thalamic Responses to Complex Sounds. Neuron 91. 10.1016/j.neuron.2016.05.041 [DOI] [PMC free article] [PubMed] [Google Scholar]
  188. Williamson RS, Sahani M, Pillow JW, 2015. The Equivalence of Information-Theoretic and Likelihood-Based Methods for Neural Dimensionality Reduction. PLoS Comput Biol 11. 10.1371/journal.pcbi.1004141 [DOI] [PMC free article] [PubMed] [Google Scholar]
  189. Willmore BDB, Cooke JE, King AJ, 2014. Hearing in noisy environments: Noise invariance and contrast gain control. Journal of Physiology 592. 10.1113/jphysiol.2014.274886 [DOI] [PMC free article] [PubMed] [Google Scholar]
  190. Willmore BDB, Schoppe O, King AJ, Schnupp JWH, Harper NS, 2016. Incorporating midbrain adaptation to mean sound level improves models of auditory cortical processing. Journal of Neuroscience 36. 10.1523/JNEUROSCI.2441-15.2016 [DOI] [PMC free article] [PubMed] [Google Scholar]
  191. Wilson NR, Runyan CA, Wang FL, Sur M, 2012. Division and subtraction by distinct cortical inhibitory networks in vivo. Nature 2012 488:7411 488, 343–348. 10.1038/nature11347 [DOI] [PMC free article] [PubMed] [Google Scholar]
  192. Winer JA, 1984. The human medial geniculate body. Hear Res 15. 10.1016/0378-5955(84)90031-5 [DOI] [PubMed] [Google Scholar]
  193. Winer JA, Morest DK, 1983. The medial division of the medial geniculate body of the cat: Implications for thalamic organization. Journal of Neuroscience 3. 10.1523/jneurosci.03-12-02629.1983 [DOI] [PMC free article] [PubMed] [Google Scholar]
  194. Winkler I, Denham SL, Nelken I, 2009. Modeling the auditory scene: predictive regularity representations and perceptual objects. Trends Cogn Sci 13, 532–540. 10.1016/J.TICS.2009.09.003 [DOI] [PubMed] [Google Scholar]
  195. Woolley SMN, Fremouw TE, Hsu A, Theunissen FE, 2005. Tuning for spectro-temporal modulations as a mechanism for auditory discrimination of natural sounds. Nat Neurosci 8. 10.1038/nn1536 [DOI] [PubMed] [Google Scholar]
  196. Xu Y, Huang Q, Wang W, Foster P, Sigtia S, Jackson PJB, Plumbley MD, 2017. Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging. IEEE/ACM Trans Audio Speech Lang Process 25. 10.1109/TASLP.2017.2690563 [DOI] [Google Scholar]
  197. Yarden TS, Nelken I, 2017. Stimulus-specific adaptation in a recurrent network model of primary auditory cortex. PLoS Comput Biol 13, e1005437. 10.1371/JOURNAL.PCBI.1005437 [DOI] [PMC free article] [PubMed] [Google Scholar]
  198. Zilany MSA, Bruce IC, Carney LH, 2014. Updated parameters and expanded simulation options for a model of the auditory periphery. J Acoust Soc Am 135. 10.1121/1.4837815 [DOI] [PMC free article] [PubMed] [Google Scholar]
  199. Zilany MSA, Bruce IC, Nelson PC, Carney LH, 2009. A phenomenological model of the synapse between the inner hair cell and auditory nerve: Long-term adaptation with power-law dynamics. J Acoust Soc Am 126, 2390. 10.1121/1.3238250 [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES