(a): To examine the linguistic nature of Whisper’s EEG predictions we referenced them to a pure language model (GPT-2-medium). We focused on GPT-2 L16 based on independent fMRI research ([26], see also Fig J in S1 Text for validation of this choice). Consistent with EEG reflecting traces of lexical processing we found that late linguistic Whisper layers captured all variance predicted by GPT-2 (and more), because the Union model (with GPT-2 and Whisper) was no more accurate than Whisper L5 or 6 alone. Differently earlier speech-like layers were complemented by GPT-2. (b): To examine whether Whisper’s accurate EEG predictions were driven by contextualized representation, Whisper’s context window size was constrained to different durations [0.5s, 1s, 5s, 10s, 20s, 30s]. Accuracy was greatest at 5-10s, suggesting that intermediate contexts spanning multiple words were beneficial. Corresponding signed ranks test Z and FDR corrected p-values are displayed on the plot. The dashed horizontal line reflects mean prediction accuracy with a 0.5s context. (c): To examine whether Whisper L6’s accurate EEG predictions were part driven by sub-lexical structure residual in Whisper’s last layer, we disrupted within word structure by either feature-wise averaging L6 vectors within word time boundaries or randomly reordering Whisper vectors within words. The outcome suggested that the EEG data additionally reflected a sub-lexical transformational stage, because either shuffling vectors within words or averaging them compromised EEG prediction in most participants. (d): To explore how the relative timing of EEG responses predicted by Whisper compared to the speech envelope and language model, we ran a battery of “single time lag” regression analyses. Model features were offset by a single lag within the range [0 to 750ms in 1/32s steps] and model-to-EEG mappings were fit on each lag separately (rather than all lags), as was repeated for each model in isolation. Whisper preferentially predicted a lag of 63ms, after the speech envelope (31ms) and before both the language model (125ms) and word surprisal (406ms). Note that the illustrated profiles chart (single-lag) prediction accuracies, and as such should not be confused with time-lagged regression beta-coefficients commonly used in the literature to estimate brain temporal response functions.