A. LN model consists of a single convolutional filter (L) followed by a static nonlinearity (N). This convolution is one-dimensional: a separate filter is convolved in time for each frequency channel, then the results are summed across frequency. Arrows indicate passthrough between units within a layer. B. Population LN (pop-LN) model is composed of a bank of temporal convolutional units (purple shading) followed by one dense unit (D) and static nonlinearity per neuron, where dense refers to a linear weighting of the outputs of the previous layer. C. Single-neuron convolutional neural network (single-CNN), or LNLN cascade [24], consists of a bank of LN units (red shading) linearly combined in a subsequent dense unit and followed by another static nonlinearity. D-F. Population CNN models consist of one or more convolutional layers and two dense, fully connected layers. Dense units in the final layer generate the output for each neuron. Convolutional units are either one-dimensional (“1D”, convolved in time and summed over frequency, derived from standard spectro-temporal models for auditory processing) or two-dimensional (“2D”, convolved in both time and frequency, derived from standard CNN models for visual processing). 1D models have either one convolutional layer (1D-CNN, D, dark green shading) or two (1Dx2-CNN, E, blue shading), while the 2D-CNN model includes three convolutional layers in sequence (F, light green shading).