Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Mar 10.
Published in final edited form as: Curr Biol. 2025 Feb 6;35(5):1023–1032.e6. doi: 10.1016/j.cub.2025.01.024

Semantic language decoding across participants and stimulus modalities

Jerry Tang 1, Alexander G Huth 1,2,3
PMCID: PMC11903136  NIHMSID: NIHMS2049997  PMID: 39919742

Summary

Brain decoders that reconstruct language from semantic representations have the potential to improve communication for people with impaired language production. However, training a semantic decoder for a participant currently requires many hours of brain responses to linguistic stimuli, and people with impaired language production often also have impaired language comprehension. In this study, we tested whether language can be decoded from a goal participant without using any linguistic training data from that participant. We trained semantic decoders on brain responses from separate reference participants, and then used functional alignment to transfer the decoders to the goal participant. Cross-participant decoder predictions were semantically related to the stimulus words, even when functional alignment was performed using movies with no linguistic content. To assess how much semantic representations are shared between language and vision, we compared functional alignment accuracy using story and movie stimuli, and found that performance was comparable in most cortical regions. Finally, we tested whether cross-participant decoders could be robust to lesions by excluding brain regions from the goal participant prior to functional alignment, and found that cross-participant decoders do not depend on data from any single brain region. These results demonstrate that cross-participant decoding can reduce the amount of linguistic training data required from a goal participant, and potentially enable language decoding from participants who struggle with both language production and language comprehension.

eTOC Blurb

Tang and Huth demonstrate that semantic decoders can be transferred across participants using functional alignment, and that this alignment can be performed using responses to either stories or movies. Their findings indicate the potential for decoding semantic information from participants with impaired language comprehension.

Introduction

To produce language, speakers need to map between semantic representations of concepts, lexical representations of word forms, and motor representations of intended movements. Damage to any stage of this process can impair language production1. An emerging way to help people with impaired language production is to predict what they want to say by decoding their brain activity2. Several studies have decoded intended speech from motor representations to help people with dysarthria, who struggle to map motor representations into muscle movements3-5. However, many people with higher-level language disorders such as aphasia struggle with earlier stages of language production—mapping semantic representations into lexical or motor representations—making existing speech decoders unsuitable6-9. A promising alternative for people with aphasia is to decode intended meaning from semantic representations10,11. However, training a semantic decoder for a goal participant currently requires collecting many hours of brain responses to linguistic stimuli. This can be challenging for people with aphasia, who often have impaired language comprehension in addition to impaired language production12.

A potential alternative is to train decoders on data from separate reference participants, and then use them to decode responses from the goal participant. This can substantially reduce the amount of training data required from the goal participant. Since the anatomical structure and functional organization of the brain vary across individuals, transferring a reference participant decoder to a goal participant requires aligning the brains of the participants. This alignment can be performed using anatomical landmarks13,14 or functional responses to shared stimuli15,16. While anatomical approaches have been unsuccessful at transferring decoders across participants11,17, functional approaches have been successfully used to transfer vision17-20 and speech21 decoders. Furthermore, previous studies have demonstrated that semantic representations can be aligned across participants, indicating the potential for transferring semantic decoders16,22.

Functional alignment is typically performed by presenting a shared set of stimuli to the goal and reference participants, and then modeling correspondences between the recorded brain responses. Previous cross-participant decoding studies have trained decoders and aligned brain responses using stimuli of the same modality. For instance, vision decoders have been trained on responses to images, and then transferred across participants using responses to other images17,23. However, in order to transfer language decoders to participants with impaired language comprehension, it may be necessary to align brain responses using non-linguistic stimuli. Unlike previous speech decoders, semantic decoders target representations that can be activated by either linguistic or non-linguistic stimuli10,11. In particular, several studies indicate that semantic representations are shared between language and vision24-29. An intriguing possibility is that functional alignment can be performed using non-linguistic visual stimuli, and then used to transfer decoders trained on linguistic stimuli. This would enable language decoding without requiring any linguistic training data from the goal participant.

In this study, we assessed how well semantic decoders can be transferred across participants and stimulus modalities. For functional alignment, we adapted a cross-participant converter approach that was previously used to transfer vision decoders17,20,23,30. In this approach, linear converter models are trained to predict the activity in each reference participant voxel using the activity in a population of goal participant voxels (Figure 1A). To test our hypothesis that semantic decoders can be transferred using non-linguistic stimuli, we trained a converter on brain responses to silent movies. To provide a ceiling for this cross-modality transfer, we also trained a converter on brain responses to narrative stories. Separately, we trained a semantic decoder11 for the reference participant using brain responses to a large set of narrative stories (Figure 1B). The decoder takes reference participant responses and reconstructs the words that the participant heard (see STAR Methods). To decode new goal participant responses, we first used a cross-participant converter to predict reference participant responses, and then applied the reference participant decoder to the predicted responses (Figure 1C).

Figure 1. Cross-participant decoding.

Figure 1.

(A) Brain responses to shared sets of narrative stories and silent movies were recorded from a goal participant and a reference participant (movie frame from Blender Foundation; https://www.sintel.org). Cross-participant converters were trained to take goal participant responses and predict how the reference participant would respond to the same stimulus. (B) Brain responses to a large set of narrative stories were recorded from the reference participant. A semantic decoder was trained to take reference participant responses and predict the words that the participant heard. (C) To decode new goal participant responses, the cross-participant converter was used to predict reference participant responses, and the reference participant decoder was applied to the predicted responses.

Results

Cross-participant decoding performance

We performed cross-participant decoding on functional magnetic resonance imaging (fMRI) responses from three neurologically healthy participants. We treated each participant as the goal participant and the other two participants as reference participants. For each pair of goal and reference participants, we trained a movie-based converter on brain responses to 70 minutes of silent movies, and a story-based converter on brain responses to 70 minutes of narrative stories (see STAR Methods). The naturalistic stimuli provide rich semantic coverage, enabling the converters to align representations of thousands of concepts16,31-33. For each reference participant, we trained a semantic decoder on a larger set of brain responses to 10 hours of narrative stories11,34. We evaluated cross-participant decoding on single-trial brain responses to a 10 minute test story that was not used in converter or decoder training. Since the decoders target semantic representations, we quantified decoding performance by comparing the predicted and actual story words using the BERTScore metric, which measures similarity of meaning by computing a matching score over contextual word embeddings35.

Previous studies have found that combining data from multiple reference participants can improve vision and speech decoding17,18,21. To test whether this applies to semantic decoding, we separately aligned brain responses between the goal participant and the two reference participants. Given new goal participant responses, we decoded each word by ensembling predictions from the two reference participant decoders (see STAR Methods). Ensembling can improve cross-participant decoding performance by reducing the effects of random noise and idiosyncratic responses. We found that cross-participant decoding using both reference participant decoders outperformed cross-participant decoding using a single reference participant decoder (Figure 2A). As a result, we ensembled predictions from the two reference participant decoders for all of the following analyses. In the future, ensembling predictions across more reference participants will likely further improve decoding performance (Table S1).

Figure 2. Cross-participant decoding performance.

Figure 2.

(A) Cross-participant decoder predictions were compared to the actual stimulus words using the BERTScore metric, which quantifies similarity of meaning. Decoding scores are shown as standard deviations away from the mean of a null distribution, which was obtained by generating null sequences from a language model without using any brain data, and then comparing the null sequences to the actual stimulus words using BERTScore (see STAR Methods). Ensembled predictions from two reference participant decoders significantly outperformed predictions from a single reference participant decoder (* indicates q(FDR) < 0.05, one-sided paired t-test across n = 3 goal participants; see Table S1 for an extended analysis). (B) Cross-participant decoders were compared to a within-participant decoder trained on 10 hours of story data from the goal participant. Overall cross-participant decoding performance was around half of within-participant decoding performance (14.63-18.75 σ) for both story-based (8.18-10.18 σ) and movie-based (7.05-8.02 σ) converters. The time-course of cross-participant decoding performance was significantly correlated with the time-course of within-participant decoding performance for both story-based (r = 0.29-0.58, q(FDR) < 0.05) and movie-based (r = 0.32-0.46, q(FDR) < 0.05) converters. (C) Three segments from the test story are shown alongside decoder predictions for one participant (see Table S2 for full transcripts). While cross-participant decoders were less precise and less consistent than the within-participant ceiling, the predictions were still semantically related to the stimulus words (see Figure S1 for an identification analysis). Qualitatively, the story-based and movie-based converters performed comparably well, indicating that stimuli of either modality can be used to transfer semantic decoders across participants. (D) Cross-participant and within-participant decoders were trained on subsets of story data from the goal participant. Decoding scores were averaged across goal participants, and markers indicate different subsets of story data. Decoding scores appeared to increase by a constant amount every time the amount of story data was doubled. Cross-participant decoders outperformed within-participant decoders when trained on less than 2 hours of story data from the goal participant. (E) Cross-participant and within-participant decoders were trained on subsets of movie data from the goal participant. Decoding scores were averaged across goal participants, and markers indicate different subsets of movie data. Decoding scores appeared to increase by a constant amount every time the amount of movie data was doubled. Cross-participant decoders outperformed within-participant decoders when trained on any amount of movie data from the goal participant.

To assess whether semantic decoders can be transferred across participants using non-linguistic stimuli, we compared cross-participant decoding performance using the story-based and movie-based converters. We obtained a ceiling for decoding performance by comparing the cross-participant decoders to within-participant decoders trained on 10 hours of story data from the goal participant11. Across the test story, cross-participant decoding performance was around half of the within-participant ceiling (14.63-18.75 σ above chance) for both the story-based (8.18-10.18 σ) and movie-based (7.05-8.02 σ) converters. Across timepoints, cross-participant decoding performance was significantly correlated with within-participant decoding performance for both the story-based (linear correlation r = 0.29-0.58, q(FDR) < 0.05) and movie-based (r=0.32-0.46, q(FDR) < 0.05) converters (Figure 2B). We have previously found that within-participant decoding performance fluctuates with the semantics of the test story11, and these results indicate that cross-participant decoders may have similar biases. Finally, we used BERTScore similarities between the predicted and actual story words to identify the part of the test story that each part of the decoder prediction corresponds to (see STAR Methods). Identification performance was significantly higher than expected by chance for both the story-based (mean percentile rank = 0.72–0.76, q(FDR) < 0.05) and movie-based (mean percentile rank = 0.66–0.73, q(FDR) < 0.05) converters (Figure S1).

Qualitatively, the cross-participant decoder predictions were semantically related to the stimulus words, although they were less accurate than the within-participant decoder predictions (Figure 2C, Table S2). Notably, the story-based and movie-based converters led to comparable performance, even though both were used to decode responses to stories. This indicates that semantic decoders can be transferred across participants solely using responses to non-linguistic stimuli.

To identify when cross-participant decoding can provide practical benefits, we trained cross-participant and within-participant decoders on varying amounts of data from the goal participant (see STAR Methods). We first trained within-participant and cross-participant decoders on randomly sampled subsets of up to 128 minutes of story data. Consistent with previous findings11, decoding performance appeared to increase by a constant amount every time the amount of story data from the goal participant was doubled (Figure 2D). While cross-participant decoders had a lower slope than within-participant decoders (mcross = 1.60, mwithin = 2.63), they had a higher intercept (bcross = 0.93, bwithin = −7.10), and outperformed within-participant decoders when trained on less than 2 hours of story data from the goal participant. This indicates that cross-participant decoding can provide some utility even for participants with relatively spared language comprehension. While this utility is currently limited, the relative advantage of cross-participant decoding may increase with the number of reference participants and the amount of data collected from each reference participant17,18,21.

We then trained within-participant and cross-participant decoders on randomly sampled subsets of up to 64 minutes of movie data. In order to train within-participant language decoders on responses to movies, we represented the stimulus movies in natural language by transcribing official audio descriptions, which were designed to capture the main events of the movies. We then performed the same decoder training procedure using these transcripts as the stimulus words (see STAR Methods). Decoding performance appeared to increase by a constant amount every time the amount of movie data from the goal participant was doubled (Figure 2E). Here, cross-participant decoders had a higher slope (mcross = 1.12, mwithin = 1.11) and intercept (bcross = 1.00, bwithin = −2.40) than within-participant decoders, and outperformed within-participant decoders when trained on any amount of movie data from the goal participant. This indicates that cross-participant decoding is the most effective approach for decoding language from participants with impaired language comprehension.

Cross-participant alignment accuracy

While the previous results demonstrate that semantic decoders can be transferred across participants and stimulus modalities, cross-participant decoding performance was still substantially lower than the within-participant ceiling. One factor that could limit the effectiveness of cross-participant decoding is how well the converters align semantic representations from one participant with those from another.

To better understand how effective the converters are, we evaluated the converters on brain responses to five repeats of the 10 minute test story. We used the converters to predict the reference participant responses based on the goal participant responses. We quantified converter performance using the linear correlation between the predicted and actual response time-courses in each reference participant voxel. To account for the low signal-to-noise ratio of fMRI recordings, we normalized converter performance in each voxel by an inter-trial noise ceiling that estimates the amount of explainable variance in the voxel’s responses36 (see STAR Methods; non-normalized correlations are shown in Figure S2). Converter performance captures how much stimulus-driven signal is shared across participants. If a reference participant voxel has high converter performance, it indicates that the signals in that voxel are related to the signals in a population of goal participant voxels.

We found that the story-based (Figure 3A, Figure S3A) and movie-based (Figure 3B, Figure S3B) converters could accurately predict responses in many regions of frontal, parietal, and temporal cortex. Movie-based converter performance approached story-based converter performance in most cortical regions, even though both were evaluated on responses to stories. This supports previous findings that semantic representations are shared between language and vision 24,25,28,29.

Figure 3. Cross-participant alignment accuracy.

Figure 3.

(A) Story-based converters were trained on brain responses to narrative stories and evaluated on brain responses to five repeats of a test story. Converter performance was quantified for each reference participant voxel using the linear correlation between the predicted and actual response time-courses. Converter performance was normalized using an inter-trial noise ceiling (see Figure S2 for non-normalized correlations). Story-based converters had high performance in many cortical regions that encode semantic, acoustic, and articulatory representations (see Figure S3A for other participants). (B) Movie-based converters were trained on brain responses to silent movies and evaluated on brain responses to five repeats of a test story. Movie-based converters had high performance in many cortical regions that encode semantic representations, but low performance in cortical regions that encode acoustic and articulatory representations (see Figure S3B for other participants). (C) Voxels were partitioned into four anatomical regions and five functional networks. Converter performance was aggregated across voxels in each anatomical region and functional network. Story-based and movie-based converter performance were significantly higher than expected by chance in each anatomical region and functional network (q(FDR) < 0.05, one-sided t-test across n = 3 reference participants), indicating the potential for using stimuli of either modality to align semantic representations across participants (see Figure S4 for a control analysis). Story-based converters outperformed movie-based converters the most in functional networks that encode people (* indicates q(FDR) < 0.05, one-sided paired t-test across n = 3 reference participants) and social concepts, and the least in functional networks that encode place and concrete concepts (see Figure S3C for a voxel-level analysis). Error bars indicate the standard error of the mean (n = 3 participants).

However, there are also cortical regions where converter performance differed (Figure S3C). These regions indicate potential avenues for improving cross-modality decoding. The main difference was in auditory cortex, where story-based converters typically performed the best while movie-based converters typically performed the worst. This is consistent with the fact that story-based converters align voxels with similar semantic representations or lower-level speech representations, whereas movie-based converters align voxels with similar semantic representations or lower-level visual representations. As auditory cortex encodes lower-level speech representations, we expect story-based converters to be more accurate37. To improve functional alignment for people with impaired language comprehension, the movie stimuli could be supplemented with non-linguistic auditory stimuli in order to more accurately align auditory cortex and other lower-level speech regions. In theory, story-based converter performance should provide a ceiling for movie-based converter performance, as the converters were evaluated on brain responses to stories. However, movie-based converters outperformed story-based converters in parts of intraparietal sulcus, precuneus, and angular gyrus. This effect was consistent across participants, indicating that it may result from differences in information density or semantic coverage between the story and movie datasets.

To quantify how converters perform across cortex, we aggregated performance within different brain regions in the reference participants (Figure 3C). We first defined four broad anatomical regions—the anterior and posterior portions of the left and right hemispheres. Story-based converters outperformed movie-based converters by a similar amount in each anatomical region. We then defined five functional networks by clustering brain responses to a separate set of story stimuli (see STAR Methods). The clusters appear to separate voxels that represent concrete, social, place, temporal, and people concepts38. Story-based converters outperformed movie-based converters the most in functional networks that encode people (q(FDR) < 0.05, one-sided paired t-test across n = 3 reference participants) and social concepts, and the least in functional networks that encode place and concrete concepts. These results indicate that movie-based converters are worse at aligning certain semantic concepts. Nonetheless, alignment accuracy in each functional network increased with the amount of data from the goal participant, indicating that movie-based converters are not fundamentally incapable of aligning semantic representations in any of the functional networks (Figure 4).

Figure 4. Cross-participant alignment of functional networks.

Figure 4.

Story-based converters were trained on subsets of up to 128 minutes of story data, and movie-based converters were trained on subsets of up to 64 minutes of movie data. Converter performance was aggregated across voxels in five functional networks. Converter performance appeared to increase by a constant amount in each network every time the amount of training data was doubled. This indicates that cross-participant converters are not fundamentally incapable of aligning semantic representations in any of the functional networks.

To assess whether the converters actually align fine-grained semantic representations or if they simply align brain regions that have similar response statistics, we also permuted the brain responses for each participant in 10-TR blocks before training cross-participant converters. We found that converters trained on paired brain responses outperformed converters trained on permuted brain responses, indicating that converters operate on fine-grained semantic representations (Figure S4).

Lesion simulations

Language disorders are typically caused by brain lesions, so cross-participant decoders should be robust to lesions1,9,12. To test this, we excluded voxels from the goal participant prior to training the converters. We first measured how excluding each brain region affects functional alignment accuracy, and then how it affects cross-participant decoding performance. Excluding a brain region provides a first-order approximation of a lesion to the region—it indicates whether data from the region is required for functional alignment and cross-participant decoding, but does not account for the potential downstream effects of damage to the region9,39.

Since anterior and posterior lesions lead to different patterns of language disorders, we separately excluded the anterior and posterior portions of the left and right hemispheres from each goal participant1,9,12,40. We then measured how the simulated lesions affect converter performance in each region of each reference participant (Figure 5A, Figure S5). For both story-based and movie-based converters, excluding posterior regions from the goal participant lowered converter performance more than excluding anterior regions (q(FDR) < 0.05, two-sided paired t-test across n = 24 regions from n = 3 reference participants). For story-based converters, excluding left hemisphere regions lowered converter performance more than excluding right hemisphere regions (q(FDR) < 0.05, two-sided paired t-test across n = 24 regions from n = 3 reference participants). However, overall alignment accuracy remained high for both story-based and movie-based converters, indicating that cross-participant alignment does not depend on data from any single anatomical region.

Figure 5. Lesion simulations.

Figure 5.

(A) To assess how different anatomical regions contribute to functional alignment, each region was excluded from the goal participant prior to converter training. Converter performance was aggregated within each region in the reference participant, and the contribution of the excluded region was quantified by the percent decrease in converter performance (see Figure S5 for a voxel-level analysis). For both story-based and movie-based converters, excluding posterior regions led to larger decreases in converter performance than excluding anterior regions. For story-based converters, excluding left hemisphere regions led to larger decreases in converter performance than excluding right hemisphere regions. (B) To assess how different functional networks contribute to functional alignment, each network was excluded from the goal participant prior to converter training. Converter performance was aggregated within each network in the reference participant, and the contribution of the excluded network was quantified by the percent decrease in converter performance (see Figure S6 for a voxel-level analysis). For both story-based and movie-based converters, excluding each network from the goal participant typically led to the largest decrease in converter performance in the corresponding network in the reference participant. (C) To assess how different anatomical regions and functional networks contribute to cross-participant decoding, each region or network was excluded from the goal participant prior to converter training. Cross-participant decoding performance was not significantly lowered when any region or network was excluded from the story-based converter (one-sided paired t-test). (D) Cross-participant decoding performance was not significantly lowered when any region or network was excluded from the movie-based converter (one-sided paired t-test).

The previous analysis captures the best-case outcome, where damage to a region does not affect representations outside of the region. In the worst-case outcome, damage to a region could disrupt the entire functional network that the region belongs to. To simulate this outcome, we separately excluded the concrete, social, place, temporal, and people semantic clusters defined in Figure 3C from each goal participant. We then measured how the simulated disruptions affect converter performance in each network of each reference participant (Figure 5B, Figure S6). For both story-based and movie-based converters, excluding a functional network from the goal participant typically lowered converter performance the most in the corresponding functional network in the reference participant. These effects were more pronounced for movie-based converters, which is consistent with the fact that movie-based converter performance is entirely driven by the alignment of semantic representations, whereas story-based converter performance is driven by the alignment of both semantic representations and lower-level speech representations. However, overall alignment accuracy remained high for both story-based and movie-based converters, indicating that cross-participant alignment does not depend on data from any single functional network.

Finally, we evaluated how excluding each anatomical region or functional network affects cross-participant decoding performance. We found that excluding each anatomical region or functional network led to only a small decrease in decoding performance for both story-based and movie-based converters (Figure 5C, Figure 5D). None of the decreases in decoding performance were statistically significant (one-sided paired t-test across n = 3 goal participants). These results support previous findings that semantic representations are redundantly encoded across cortex in neurologically healthy participants10,11,41. These results are also consistent with previous findings that cortical lesions often spare semantic knowledge27,42-45.

Together, these results characterize how different brain regions contribute to functional alignment and cross-participant decoding in neurologically healthy participants. We found that cross-participant decoders do not depend on data from any single brain region, which could make them robust to lesions. However, our simulations may not fully capture the effects of lesions, so it is important to evaluate the feasibility of cross-participant decoding in participants with actual lesions and language disorders.

Discussion

Our study demonstrates that semantic representations can be functionally aligned across participants and stimulus modalities, and that this alignment can be used to transfer semantic decoders. Most existing decoders target perceptual representations that are activated by a single stimulus modality, such as vision17,23 or speech21. As a result, functional alignment and decoder training have previously only been performed using stimuli of the same modality. By contrast, semantic representations can be activated by either linguistic or non-linguistic stimuli. This enables semantic representations to be aligned using responses to either stories or movies.

Our results have practical implications for people with aphasia, who often have impaired language comprehension in addition to impaired language production. Since non-linguistic semantic processing is preserved in many people with aphasia27,42-45, it may be possible to train semantic decoders on data from neurologically healthy reference participants, and use movie-based converters to transfer the decoders to a goal participant with aphasia. The decoders may then be used to generate natural language predictions of the concepts that the goal participant is thinking about. In this study, we evaluated decoders on brain responses to perceived speech in order to compare decoder predictions to known stimulus words, but we previously found that semantic decoders can also decode responses to imagined speech11. An important direction for future work is assessing whether semantic decoding can reconstruct perceived and imagined speech from participants with aphasia and other language disorders.

The transfer of decoders across participants and stimulus modalities also has important implications for mental privacy46. We previously found that a person’s cooperation is required both to train and apply semantic decoders11. Since functional alignment requires training data collected with the goal participant’s cooperation, we believe that this conclusion still holds. However, cross-participant transfer can reduce the amount of training data required from the goal participant, and cross-modality transfer can obscure the relationship between the training task and the testing task. For instance, a participant may consent to having their brain recorded while they watch silent videos, without realizing that the brain responses could be used for language decoding. We believe it is important to continually assess whether decoders can be transferred across different tasks, in order to understand the implications of collecting different types of brain data47. Ultimately, we believe that brain decoding models—especially those that operate across participants or stimulus modalities—necessitate the creation of new rights or the reinterpretation of existing rights to protect mental privacy48.

STAR Methods

Experimental model and study participant details

Participants

Data were collected from seven participants: S1 (male, age 38 years at time of most recent scan), S2 (male, age 25 years), S3 (male, age 22 years), S4 (female, age 22 years), S5 (female, age 23 years), S6 (female, age 23 years), and S7 (male, age 25 years). Data from S1, S2 and S3 were used for the main decoding analyses. Data from S4, S5, S6, and S7 were used to define functional networks. Data from all participants were used in an extended scaling analysis for story-based converters (Table S1). No statistical methods were used to pre-determine sample sizes but our sample sizes were similar to those reported in previous decoding studies11,17,23,49. No blinding was performed as there were no experimental groups. All participants were healthy and had normal hearing, and normal or corrected-to-normal vision. The experimental protocol was approved by the Institutional Review Board at the University of Texas at Austin. Written informed consent was obtained from all participants. Participants were compensated at a rate of $25 per hour. No data were excluded from analysis.

Method details

MRI data collection

MRI data were collected on a 3T Siemens Skyra scanner, a 3T Siemens Vida scanner, and a 3T Siemens Prisma scanner at the UT Austin Biomedical Imaging Center using a 64-channel Siemens volume coil. To stabilize head motion, participants wore a personalized head case that precisely fit the shape of each participant’s head.

Functional data were collected using gradient echo EPI with repetition time (TR) = 2.00 s, echo time (TE) = 30.8 ms, flip angle = 71°, multi-band factor (simultaneous multi-slice) = 2, voxel size = 2.6mm x 2.6mm x 2.6mm (slice thickness = 2.6mm), matrix size = (84, 84), and field of view = 220 mm. Anatomical data for all participants except S1 were collected using a T1-weighted multi-echo MP-RAGE sequence on the same 3T Siemens Skyra scanner with voxel size = 1mm x 1mm x 1mm following the Freesurfer morphometry protocol. Anatomical data for participant S1 were collected on a 3T Siemens TIM Trio scanner at the UC Berkeley Brain Imaging Center with a 32-channel Siemens volume coil using the same sequence.

Experimental tasks

The story dataset consisted of 53 5-17 m stories taken from The Moth Radio Hour and Modern Love11,34. In each story, a single speaker tells an autobiographical narrative. Each story was played during a separate fMRI scan with a buffer of 10 s of silence before and after the story. These data were collected during 10 scanning sessions, with each session consisting of 5 or 6 stories. Stories were played over Sensimetrics S14 in-ear piezoelectric headphones. The audio for each stimulus was converted to mono and filtered to correct for frequency response and phase errors induced by the headphones using calibration data provided by Sensimetrics and custom Python code (https://github.com/alexhuth/sensimetrics_filter). All stimuli were played at 44.1 kHz using the pygame library in Python. Each story was manually transcribed by one listener. Certain sounds (for example, laughter and breathing) were also marked to improve the accuracy of the automated alignment. The audio of each story was then downsampled to 11kHz and the Penn Phonetics Lab Forced Aligner (P2FA)50 was used to automatically align the audio to the transcript. After automatic alignment was complete, Praat51 was used to check and correct each aligned transcript manually.

The movie dataset consisted of 13 3-7 m movie clips from Pixar Animation Studios and The Blender Foundation. The movie clips were self-contained and almost entirely devoid of language. The original high-definition movie clips were cropped and downsampled to 727 x 409 pixels. Each movie clip was presented without sound during a single fMRI scan, with a 10 s black screen buffer before and after the movie clip.

The test dataset consisted of 5 repeats of the 10 m story “Where There’s Smoke” by Jenifer Hixson from The Moth Radio Hour. The test story was held out from model training. Each repeat of the test story was played during a single fMRI scan with a buffer of 10 s of silence before and after the story.

fMRI data preprocessing

Each functional run was motion-corrected using the FMRIB Linear Image Registration Tool (FLIRT) from FSL 5.014. All volumes in the run were then averaged to obtain a high quality template volume. FLIRT was then used to align the template volume for each run to the overall template, which was chosen to be the template for the first functional run for each participant. These automatic alignments were manually checked.

Low-frequency voxel response drift was subtracted from the signal. The mean response for each voxel was then subtracted and the remaining response was scaled to have unit variance. For responses to stories, low-frequency voxel response drift was identified using a 2nd order Savitsky-Golay filter with a 120 second window. For responses to movies, low-frequency voxel response drift was identified using a Legendre polynomial of degree 3, since the movie scans were shorter than the story scans52.

Cortical surface visualization

Cortical surface meshes were generated from the T1-weighted anatomical scans using Freesurfer13. Before surface reconstruction, anatomical surface segmentations were hand-checked and corrected. Blender was used to remove the corpus callosum and make relaxation cuts for flattening. Functional images were aligned to the cortical surface using boundary based registration (BBR) implemented in FSL. These alignments were manually checked for accuracy and adjustments were made as necessary.

Flatmaps were created by projecting the values for each voxel onto the cortical surface using the “nearest” scheme in pycortex53. This projection finds the location of each pixel in the flatmap in 3D space and assigns that pixel the associated value.

Bayesian decoding

The goal of language decoding is to maximize the probability distribution P(SR) over word sequences S given brain responses R. In Bayesian decoding, P(SR) is factorized into a prior distribution P(S) over word sequences and an encoding distribution P(RS) over brain responses given word sequences. The prior distribution P(S) can be estimated using a language model, and the encoding distribution P(RS) can be estimated using an encoding model11.

Language model

Generative Pre-trained Transformer (GPT, also known as GPT-1) is a 12-layer transformer neural network trained to predict the probability distribution over the next word sn in a sequence (s1,s2,,sn1)54. GPT was used to estimate the prior distribution P(S). Given a word sequence S=(s1,s2,,sn) GPT estimates the probability of observing S in natural language by multiplying the probabilities of each word conditioned on the previous words: P(S)=1nP(sis1:i1) where s1:0 is the empty sequence ∅.

GPT was also used to extract features from language stimuli. In order to perform the next word prediction task, GPT extracts quantitative features that capture the meaning of input sequences. Given a word sequence S=(s1,s2,,sn), the GPT hidden layer activations provide vector embeddings that represent the meaning of the most recent word sn in context.

Encoding model

Encoding models predict brain responses from stimulus words41. To train an encoding model for a participant, quantitative features are extracted from the stimulus words, and regularized linear regression is used to estimate a set of weights that predict how each feature affects the BOLD signal in each voxel.

For encoding models trained on the story dataset, quantitative features were extracted from transcripts of the stories. For encoding models trained on the movie dataset (Figure 2E), quantitative features were extracted from transcripts of official audio descriptions from Pixar Animation Studios. For each word-time pair (si,ti), the word sequence (si5,si4,,si1,si) was provided to the GPT language model, and features of si were extracted from the ninth layer of the model38,55-57. This yields a new list of vector-time pairs (Mi,ti) where Mi is a 768-dimensional embedding for si. These vectors were then resampled at times corresponding to the fMRI acquisitions using a three-lobe Lanczos filter41.

A linearized finite impulse response (FIR) model was fit to every cortical voxel in each participant’s brain41. A separate linear temporal filter with four delays (t1, t2, t3, and t4 timepoints) was fit for each of the 768 features, yielding a total of 3,072 features. With a TR of 2 s this was accomplished by concatenating the feature vectors from 2, 4, 6, and 8 s earlier to predict responses at time t. Before doing regression, each feature channel of the training matrix was z-scored.

The 3,072 weights for each voxel were estimated using L2-regularized linear regression41. The regression procedure has a single free parameter that controls the degree of regularization. This regularization coefficient was found for each voxel by repeating a regression and cross-validation procedure 50 times. In each iteration, approximately a fifth of the timepoints were removed from the model training dataset and reserved for validation. Then, the model weights were estimated on the remaining timepoints for each of 10 possible regularization coefficients (log spaced between 10 and 1,000). These weights were used to predict responses for the reserved timepoints, and then R2 was computed between actual and predicted responses. For each voxel, the regularization coefficient was chosen as the value that led to the best performance, averaged across bootstraps, on the reserved timepoints. The 10,000 cortical voxels with the highest cross-validation performance were used for decoding.

The encoding model estimates a function R^ that maps from stimulus features S to predicted brain responses R^(S). Assuming that BOLD signals are affected by Gaussian additive noise, the likelihood of observing brain responses R given stimulus features S can be modeled as a multivariate Gaussian distribution P(RS) with mean μ=R^(S) and covariance =(RR^(S))T(RR^(S))49. A bootstrap procedure was used to estimate . Each story or movie scan was held out from the model training dataset, and an encoding model was estimated using the remaining data. A bootstrap noise covariance matrix for the held-out scan was computed using the residuals between the predicted responses and the actual responses. The bootstrap noise matrices were averaged across held-out scans to obtain .

All model fitting and analysis was performed using custom software written in Python, making heavy use of NumPy58, SciPy59, PyTorch60, Transformers61, and pycortex53.

Word rate model

A word rate model was estimated for each participant to predict when words are processed. The word rate at each fMRI acquisition was defined as the number of stimulus words that occurred since the previous acquisition. A 33-dimensional vector of response features for each fMRI acquisition was obtained by averaging brain responses within 33 Freesurfer ROIs. Regularized linear regression was used to estimate a set of weights that predict the word rate from the response features.

A separate linear temporal filter with 4 delays (t+1, t+2, t+3, and t+4) was fit for each voxel. With a TR of 2 s this was accomplished by concatenating the response features from 2, 4, 6 and 8 s later to predict the word rate at time t. Given novel brain responses, this model predicts the word rate at each acquisition. The time between consecutive acquisitions (2 s) is then evenly divided by the predicted word rates (rounded to the nearest nonnegative integers) to predict word times. Since the response features are computed using ROIs that are shared across participants, a word rate model trained on reference participant responses can be applied to goal participant responses.

Beam search

Given new brain responses Rtest, the language model and encoding model can be used to evaluate P(SRtest) for any word sequence S. However, the combinatorial structure of natural language makes it computationally infeasible to evaluate all possible word sequences. As a result, the beam search decoding approach introduced in ref.11 was used to iteratively construct word sequences with high values of P(S) and P(RtestS).

The decoder maintains a beam containing the k most likely word sequences. The beam is initialized with an empty word sequence. When new words are detected by the word rate model, the language model generates continuations for each candidate S in the beam. The language model uses the last 8 s of predicted words (sni,,sn1) in the candidate to predict the distribution P(snsni,,sn1) over the next word. Each word in the language model nucleus62 is appended to the candidate to form a continuation C.

The encoding model scores each continuation by the likelihood P(RtestC) of observing the recorded brain responses. The k most likely continuations across all candidates are retained in the beam. After iterating through all of the predicted word times, the decoder outputs the candidate sequence with the highest likelihood.

Decoder parameters

The decoding procedure has several parameters that affect model performance. The beam search algorithm is parameterized by the beam width k. The encoding model is parameterized by the number of context words provided when extracting GPT embeddings. The noise model is parameterized by a shrinkage factor a that regularizes the covariance . The language model is parameterized by the length of the input context, the nucleus mass p and ratio r, and the set of possible output words.

The decoder parameters were tuned in ref.11 by decoding responses to a calibration story separate from the training and test stories, and then validating and adjusting the best-performing parameter values through qualitative analysis of decoder predictions.

Functional alignment

To perform functional alignment between a goal participant g and a reference participant r, brain responses Rg and Rr were recorded while the participants were shown a shared set of stimuli. A converter Cgr was trained to predict the activity in each reference participant voxel using the activity in a population of v goal participant voxels. Since the reference participant decoders target semantic representations, the v goal participant voxels should encode semantic representations. Semantically selective voxels are typically identified using encoding model performance or inter-trial correlation34. However, these approaches require linguistic training data from the goal participant, making them unsuitable for participants with impaired language comprehension. In this study, semantically selective voxels were instead identified using functional or anatomical alignment.

For the cross-participant decoding analyses (Figure 2, Figure 5C, Figure 5D), goal participant voxels were identified using functional alignment. In this approach, a reverse converter Crg was trained to take reference participant responses and predict goal participant responses. Semantically selective voxels were identified in the reference participant using encoding model cross-validation performance11, and linear regression was used to estimate the reverse converter. Then, the reverse converter was applied to brain responses while the reference participant listened to held-out stories, producing an aligned response matrix Rrg. This process was performed for both reference participants, and linear correlation was computed between the two aligned response matrices. This process identifies goal participant voxels with semantic representations that are shared across participants. The v=15,000 goal participant voxels with the highest correlations were selected. These goal participant voxels are shown in Figure S7A for story-based converters and Figure S7B for movie-based converters.

While precise, this functional alignment approach identifies different goal participant voxels for story-based and movie-based converters. In analyses that directly compare story-based and movie-based converter performance (Figure 3, Figure 5A, Figure 5B), the converters should be trained on the same goal participant voxels. For these analyses, goal participant voxels were identified using anatomical alignment. In this approach, surface-based alignment63 was used to align encoding model cross-validation performance from the reference participants with the brain of the goal participant. The v=15,000 goal participant voxels with the highest aligned performance were selected. These goal participant voxels are shown in Figure S7C. While the anatomical alignment approach was less precise than the functional alignment approach, it nonetheless identified many semantically selective regions.

After the goal participant voxels were identified, L2-regularized linear regression was used to estimate the converter Cgr. For each reference participant voxel, a set of v weights were estimated to predict the activity in that voxel based on the activity in the v goal participant voxels. The regression procedure has a single free parameter that controls the degree of regularization. This regularization coefficient was found for each voxel using the same cross-validation procedure used for encoding model estimation. Given new goal participant responses, the converter predicts reference participant responses Rgr=CgrRg.

Cross-participant decoding

Cross-participant decoding was evaluated on brain responses to the test story. To predict word times, the word rate models for the reference participants were applied to the goal participant responses. The time between consecutive acquisitions (2 s) was evenly divided by the mean of the predicted word rates across the reference word rate models (rounded to the nearest nonnegative integers).

To predict word identities, the cross-participant converters were used to align the goal participant responses with the brains of the reference participants. Following the beam search decoding approach, word sequences C were generated using the language model. To obtain decoder predictions using a single reference participant r, the likelihood P(RgC) was estimated by computing P(RgrC) using the encoding model trained on the reference participant. Previous studies have found that ensembling predictions from multiple reference participants can improve decoding performance17,18,21. To ensemble predictions from multiple reference participants, the likelihood P(RgC) was estimated by computing P(RgrC) for each reference participant r, and then multiplying the likelihoods across the reference participants.

Datasets

Participants S1, S2, and S3 listened to 10 hours of stories and watched 70 minutes of movies. Participants S4 and S5 listened to 9 hours of stories. Participants S6 and S7 listened to 5 hours of stories.

For the analysis in Table S1, participants S6 and S7 were treated as goal participants, and participants S1, S2, S3, S4, and S5 were treated as reference participants. Reference participant decoders were trained on 9 hours of story data, and cross-participant converters were trained on the first 70 minutes of story data.

For all other analyses, a bootstrap procedure was performed using participants S1, S2, and S3. Each participant was treated as the goal participant, and the other two participants were treated as reference participants. Reference participant decoders were trained on 10 hours of story data, and cross-participant converters were trained on the first 70 minutes of story data or movie data.

For the analysis in Figure 2D, cross-participant and within-participant decoders were trained on subsets of approximately 8, 16, 32, 64, and 128 minutes of story data. For the analysis in Figure 2E, cross-participant and within-participant decoders were trained on subsets of approximately 8, 16, 32, and 64 minutes of movie data. To sample a subset of approximately t minutes from a full set of s minutes, each stimulus was included with probability p=ts. Story subsets were sampled from 5 hours of stories, and movie subsets were sampled from 70 minutes of movies. This process was performed twice for each target duration. For the cross-participant condition, reference participant decoders were trained on 10 hours of story data, and cross-participant converters were trained on the sampled subset. For the within-participant condition, goal participant decoders were trained on the sampled subset. Because noise model estimation requires brain responses to multiple stimuli, subsets that only contained one story or movie were excluded from the within-participant condition.

Anatomical regions

Whole brain MRI data were partitioned into 4 anatomical regions using Freesurfer ROIs: anterior left hemisphere, posterior left hemisphere, anterior right hemisphere, and posterior right hemisphere. Anterior regions were defined using the superiorfrontal, rostralmiddlefrontal, caudalmiddlefrontal, parsopercularis, parstriangularis, parsorbitalis, lateralorbitofrontal, medialorbitofrontal, precentral, paracentral, frontalpole, rostralanteriorcingulate, and caudalanteriorcingulate labels. Posterior regions were defined using the superiorparietal, inferiorparietal, supramarginal, postcentral, precuneus, posteriorcingulate, isthmuscingulate, superiortemporal, middletemporal, inferiortemporal, bankssts, fusiform, transversetemporal, entorhinal, temporalpole, parahippocampal, lateraloccipital, lingual, cuneus, and pericalcarine labels.

Functional networks

Whole brain MRI data were partitioned into 5 functional networks using a clustering approach38. Functional networks were defined using brain responses to 4 hours of story stimuli from participants S4, S5, S6, and S711,38. A 985-dimensional lexical embedding space was used to extract semantic features from each stimulus word, and linear regression was used to estimate encoding models that predict brain responses from the stimulus features41. Encoding model weights were averaged across the 4 delays to produce a 985-dimensional vector for each voxel, which captures how that voxel responds to the stimulus features. The top 10,000 voxels were identified in each participant based on cross-validation performance. The weights were concatenated across participants and rescaled to have unit norm. Principal components analysis was applied to the weights, and 64 dimensions that explained 80% of the variance were chosen.

Spherical k-means clustering was applied to the principal components. To determine the number of clusters, the inertia of the clustering algorithm was computed for a range of clusters between 1 and 20. This was used to identify the point where the inertia changes from an exponential drop to a linear drop in inertia. This point occurred at 5 clusters. Each of the 5 clusters was interpreted by projecting word embeddings onto the cluster centroid. The clusters were subjectively determined to represent concrete, social, place, temporal, and people concepts.

The functional networks were then defined in participants S1, S2, and S3. Encoding models were estimated using brain responses to 4 hours of story stimuli. The encoding model weights for each voxel were projected onto the 64 principal components, and then assigned to one of the 5 clusters. The process assigns a functional network to every voxel.

Quantification and statistical analysis

Decoding performance

The decoder predictions were compared to the stimulus words using the BERTScore metric, which was designed to quantify the similarity of meaning between predicted and actual word segments35. To compute BERTScore, a bidirectional transformer language model is used to represent each word in the predicted and actual segments as a contextual embedding. Each word in the actual segment is scored based on its maximum cosine similarity across all of the words in the predicted segment, and these scores are averaged across the words in the actual segment.

BERTScore was computed between the predicted and actual words within a 20 s segment around every second of the stimulus (window similarity). The scores were averaged across segments to quantify how well the decoder predicted the full stimulus (story similarity).

Null models

BERTScore values are not inherently interpretable, as they are derived from contextual embeddings35. As a result, BERTScore values were normalized with respect to a null distribution.

Null sequences were generated by sampling from the language model without using any brain data except to predict word times11. The null model maintains a beam of 10 candidate sequences and generates continuations from the language model nucleus64 at each predicted word time. The only difference between the actual decoder and the null model is that, instead of ranking the continuations by the likelihood of the fMRI data, the null model randomly assigns a likelihood to each continuation. After iterating through all of the predicted word times, the null model outputs the candidate sequence with the highest likelihood. For each participant, this process was repeated 200 times to generate 200 null sequences. This process is as similar as possible to the actual decoder without using any brain data to select words, so these sequences reflect the null hypothesis that the decoder does not recover meaningful information about the stimulus from the brain data. The null sequences were scored against the transcript of the test story to produce a null distribution of decoding scores for each participant.

In order to ensure that this null model does not provide trivially low estimates, an alternative null model was constructed by permuting the actual voxel responses before performing within-participant decoding. This alternative null model preserves the statistics of the brain responses. For each participant, this process was repeated 10 times to generate 10 alternative null sequences, which were scored against the transcript of the test story to produce an alternative null distribution of decoding scores for each participant. The means of the alternative null distribution (0.7897-0.7909) were slightly higher than the means of the original null distributions (0.7898-0.7899), but still lower than the observed decoding scores (0.7998-0.8016 for story-based converters, 0.7981-7996 for movie-based converters). The standard deviations of the alternative null distributions (0.0007-0.0009) were lower than the standard deviations of the original null distributions (0.0012-0.0013). These comparisons indicate that the original null model does not trivially inflate the decoding scores.

Identification performance

Another way to quantify decoding performance is to take segments of the decoder prediction and identify their corresponding locations in the actual story. To do this, a similarity matrix M is constructed where Mi,j reflects the BERTScore similarity between the i-th segment of the decoder prediction and the j-th segment of the actual story. For each timepoint i, the actual segments are sorted based on their similarity to the i-th predicted segment, and the timepoint is scored based on the percentile rank of the i-th actual segment. A high percentile rank for the i-th predicted segment indicates that it is possible to identify the location of the segment in the actual story. These percentile ranks are averaged across the predicted segments. The mean percentile rank ranges from 0 to 1, where 0.5 indicates chance level.

Converter performance

Converter performance was evaluated on brain responses to five repeats of the test story. Brain responses were averaged across repeats to increase the signal-to-noise ratio. Converters were used to predict reference participant responses based on goal participant responses, and linear correlation was computed between the aligned and actual response time-courses in each reference participant voxel. Converter performance was normalized using an inter-trial noise ceiling36, which estimates the linear correlation between the actual reference participant responses and the best possible prediction that could theoretically be obtained.

While normalizing using an inter-trial noise ceiling is a common approach, a caveat is that the noise ceiling does not account for memory effects. This may cause it to underestimate the amount of explainable variance in brain regions where responses systematically change as the participant listens to more repeats of the test story. To ensure that the results are not biased by the inter-trial noise ceiling, the non-normalized correlation values are provided in Figure S2. The normalization should not bias the comparisons between story-based and movie-based converter predictions, since the two sets of predictions were normalized using the same story-based noise ceiling.

To aggregate performance within an anatomical region or functional network, the noise-ceiling corrected responses were averaged across all voxels in the region or network that belong to the 10,000 voxels used for decoding.

Statistical testing

Statistical significance of decoding performance was tested for each participant and then replicated across all participants (n = 3). To test statistical significance of the decoding scores, the observed decoding score was compared to the decoding scores of the null sequences; p values were computed as the fraction of null sequences with a decoding score greater than or equal to than the observed decoding score. To evaluate the statistical significance of the identification analysis, a null distribution of similarity matrices was computed by randomly shuffling 10-row blocks of the similarity matrix M. This shuffling procedure was performed 2,000 times. The observed mean percentile rank was compared to the mean percentile ranks of the shuffled similarity matrices; p values were computed as the fraction of shuffled similarity matrices with a mean percentile rank greater than or equal to than the observed mean percentile rank.

All other statistical tests were performed using each participant as an observation (n = 3) unless otherwise stated.

All tests were corrected for multiple comparisons when necessary using the FDR65. Data distributions were assumed to be normal, but this was not formally tested due to the small-n study design. The range across participants was reported for all quantitative results.

Supplementary Material

1

Table S2. Decoder predictions for a perceived story, related to Figure 2. Language decoders were evaluated on single-trial BOLD fMRI responses recorded while a goal participant listened to the test story “Where There’s Smoke” by Jenifer Hixson from The Moth Radio Hour. The actual stimulus words are shown alongside the decoder predictions from a within-participant decoder trained on 10 hours of story data from the goal participant, a cross-participant decoder trained on 70 minutes of story data from the goal participant, and a cross-participant decoder trained on 70 minutes of movie data from the goal participant.

2

Key resources table

REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited data
Story fMRI data reported in LeBel et al.12 Huth Lab, Texas, USA https://doi.org/10.18112/openneuro.ds003020.v3.0.0
Movie fMRI data This study https://doi.org/10.18112/openneuro.ds005717.v1.0.0
Software and algorithms
Code for cross-participant semantic decoding This study https://doi.org/10.12751/g-node.fh54ec

Highlights.

  • Semantic decoders can be transferred across participants using functional alignment

  • Semantic representations can be aligned using responses to either stories or movies

  • Cross-participant semantic decoding is robust to simulated lesions

Acknowledgements

We thank S. Wilson for comments on this manuscript and M. Henry for discussions about this project. This work was supported by the National Institute on Deafness and Other Communication Disorders under award number 1R01DC020088-001 (A.G.H.), the Whitehall Foundation (A.G.H.), the Alfred P. Sloan Foundation (A.G.H.) and the Burroughs Wellcome Fund (A.G.H.).

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Resource Availability

Lead contact

Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Alexander G. Huth (huth@cs.utexas.edu).

Materials availability

This study did not create any new materials.

Data and code availability

Declaration of interests

A.G.H. and J.T. are inventors on a pending patent application (the applicant is The University of Texas System Board of Regents) that is directly relevant to the language decoding approach used in this work.

References

  • 1.Geschwind N. (1970). The organization of language and the brain. Science 170, 940–944. [DOI] [PubMed] [Google Scholar]
  • 2.Silva AB, Littlejohn KT, Liu JR, Moses DA, and Chang EF (2024). The speech neuroprosthesis. Nat. Rev. Neurosci, 1–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Moses DA, Metzger SL, Liu JR, Anumanchipalli GK, Makin JG, Sun PF, Chartier J, Dougherty ME, Liu PM, Abrams GM, et al. (2021). Neuroprosthesis for Decoding Speech in a Paralyzed Person with Anarthria. N. Engl. J. Med 385, 217–227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Willett FR, Kunz EM, Fan C, Avansino DT, Wilson GH, Choi EY, Kamdar F, Glasser MF, Hochberg LR, Druckmann S, et al. (2023). A high-performance speech neuroprosthesis. Nature 620, 1031–1036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Metzger SL, Littlejohn KT, Silva AB, Moses DA, Seaton MP, Wang R, Dougherty ME, Liu JR, Wu P, Berger MA, et al. (2023). A high-performance neuroprosthesis for speech decoding and avatar control. Nature 620, 1037–1046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.DeLeon J, Gottesman RF, Kleinman JT, Newhart M, Davis C, Heidler-Gary J, Lee A, and Hillis AE (2007). Neural regions essential for distinct cognitive processes underlying picture naming. Brain 130, 1408–1422. [DOI] [PubMed] [Google Scholar]
  • 7.Hickok G, and Poeppel D (2007). The cortical organization of speech processing. Nat. Rev. Neurosci 8, 393–402. [DOI] [PubMed] [Google Scholar]
  • 8.Basilakos A, Rorden C, Bonilha L, Moser D, and Fridriksson J (2015). Patterns of poststroke brain damage that predict speech production errors in apraxia of speech and aphasia dissociate. Stroke 46, 1561–1566. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Fridriksson J, den Ouden D-B, Hillis AE, Hickok G, Rorden C, Basilakos A, Yourganov G, and Bonilha L (2018). Anatomy of aphasia revisited. Brain 141, 848–862. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Pereira F, Lou B, Pritchett B, Ritter S, Gershman SJ, Kanwisher N, Botvinick M, and Fedorenko E (2018). Toward a universal decoder of linguistic meaning from brain activation. Nat. Commun 9, 963. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Tang J, LeBel A, Jain S, and Huth AG (2023). Semantic reconstruction of continuous language from non-invasive brain recordings. Nat. Neurosci 26, 858–866. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Wilson SM, Entrup JL, Schneck SM, Onuscheck CF, Levy DF, Rahman M, Willey E, Casilio M, Yen M, Brito AC, et al. (2023). Recovery from aphasia in the first year after stroke. Brain 146, 1021–1039. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Dale AM, Fischl B, and Sereno MI (1999). Cortical surface-based analysis. I. Segmentation and surface reconstruction. Neuroimage 9, 179–194. [DOI] [PubMed] [Google Scholar]
  • 14.Jenkinson M, and Smith S (2001). A global optimisation method for robust affine registration of brain images. Med. Image Anal 5, 143–156. [DOI] [PubMed] [Google Scholar]
  • 15.Hasson U, Nir Y, Levy I, Fuhrmann G, and Malach R (2004). Intersubject synchronization of cortical activity during natural vision. Science 303, 1634–1640. [DOI] [PubMed] [Google Scholar]
  • 16.Haxby JV, Guntupalli JS, Connolly AC, Halchenko YO, Conroy BR, Gobbini MI, Hanke M, and Ramadge PJ (2011). A common, high-dimensional model of the representational space in human ventral temporal cortex. Neuron 72, 404–416. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Ho JK, Horikawa T, Majima K, Cheng F, and Kamitani Y (2023). Inter-individual deep image reconstruction via hierarchical neural code conversion. Neuroimage 271, 120007. [DOI] [PubMed] [Google Scholar]
  • 18.Thual A, Benchetrit Y, Geilert F, Rapin J, Makarov I, Banville H, and King J-R (2023). Aligning brain functions boosts the decoding of visual semantics in novel subjects. arXiv [cs.LG]. [Google Scholar]
  • 19.Scotti PS, Tripathy M, Villanueva CKT, Kneeland R, Chen T, Narang A, Santhirasegaran C, Xu J, Naselaris T, Norman KA, et al. (2024). MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data. arXiv [cs.CV]. [Google Scholar]
  • 20.Ferrante M, Boccato T, and Toschi N (2024). Through their eyes: multi-subject Brain Decoding with simple alignment techniques. Imaging Neuroscience. [Google Scholar]
  • 21.Défossez A, Caucheteux C, Rapin J, Kabeli O, and King J-R (2023). Decoding speech perception from non-invasive brain recordings. Nat. Mach. Intell 5, 1097–1107. [Google Scholar]
  • 22.Taschereau-Dumouchel V, Cortese A, Chiba T, Knotts JD, Kawato M, and Lau H (2018). Towards an unconscious neural reinforcement intervention for common fears. Proc. Natl. Acad. Sci. U. S. A 115, 3470–3475. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Yamada K, Miyawaki Y, and Kamitani Y (2015). Inter-subject neural code converter for visual image representation. Neuroimage 113, 289–297. [DOI] [PubMed] [Google Scholar]
  • 24.Fairhall SL, and Caramazza A (2013). Brain regions that represent amodal conceptual knowledge. J. Neurosci 33, 10552–10558. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Devereux BJ, Clarke A, Marouchos A, and Tyler LK (2013). Representational similarity analysis reveals commonalities and differences in the semantic processing of words and objects. J. Neurosci 33, 18906–18916. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Martin A. (2016). GRAPES—Grounding representations in action, perception, and emotion systems: How object properties and categories are represented in the human brain. Psychon. Bull. Rev 23, 979–990. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Ivanova AA, Mineroff Z, Zimmerer V, Kanwisher N, Varley R, and Fedorenko E (2021). The Language Network Is Recruited but Not Required for Nonverbal Event Semantics. Neurobiology of Language 2, 176–201. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Popham SF, Huth AG, Bilenko NY, Deniz F, Gao JS, Nunez-Elizalde AO, and Gallant JL (2021). Visual and linguistic semantic representations are aligned at the border of human visual cortex. Nat. Neurosci 24, 1628–1636. [DOI] [PubMed] [Google Scholar]
  • 29.Tang J, Du M, Vo VA, Lal V, and Huth AG (2023). Brain encoding models based on multimodal transformers can transfer across language and vision. In Advances in Neural Information Processing Systems 37, pp. 29654–29666. [PMC free article] [PubMed] [Google Scholar]
  • 30.Mell MM, St-Yves G, and Naselaris T (2021). Voxel-to-voxel predictive models reveal unexpected structure in unexplained variance. Neuroimage 238, 118266. [DOI] [PubMed] [Google Scholar]
  • 31.Hamilton LS, and Huth AG (2018). The revolution will not be controlled: natural stimuli in speech neuroscience. Language, Cognition and Neuroscience, 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Haxby JV, Guntupalli JS, Nastase SA, and Feilong M (2020). Hyperalignment: Modeling shared information encoded in idiosyncratic cortical topographies. Elife 9, e56601. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Nastase SA, Goldstein A, and Hasson U (2020). Keep it real: rethinking the primacy of experimental control in cognitive neuroscience. Neuroimage 222, 117254. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.LeBel A, Wagner L, Jain S, Adhikari-Desai A, Gupta B, Morgenthal A, Tang J, Xu L, and Huth AG (2023). A natural language fMRI dataset for voxelwise encoding models. Sci Data 10, 555. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Zhang T, Kishore V, Wu F, Weinberger KQ, and Artzi Y (2020). BERTScore: Evaluating Text Generation with BERT. In 8th International Conference on Learning Representations. [Google Scholar]
  • 36.Schoppe O, Harper NS, Willmore BDB, King AJ, and Schnupp JWH (2016). Measuring the Performance of Neural Models. Front. Comput. Neurosci 10, 10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.de Heer WA, Huth AG, Griffiths TL, Gallant JL, and Theunissen FE (2017). The hierarchical cortical organization of human speech processing. J. Neurosci 37, 6539–6557. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.LeBel A, Jain S, and Huth AG (2021). Voxelwise Encoding Models Show That Cerebellar Language Representations Are Highly Conceptual. J. Neurosci 41, 10341–10355. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Carter AR, Shulman GL, and Corbetta M (2012). Why use a connectivity-based approach to study stroke and recovery of function? Neuroimage 62, 2271–2280. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Fridriksson J, Yourganov G, Bonilha L, Basilakos A, Den Ouden D-B, and Rorden C (2016). Revealing the dual streams of speech processing. Proc. Natl. Acad. Sci. U. S. A 113, 15108–15113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Huth AG, de Heer WA, Griffiths TL, Theunissen FE, and Gallant JL (2016). Natural speech reveals the semantic maps that tile human cerebral cortex. Nature 532, 453–458. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Grober E, Perecman E, Kellar L, and Brown J (1980). Lexical knowledge in anterior and posterior aphasics. Brain Lang. 10, 318–330. [DOI] [PubMed] [Google Scholar]
  • 43.Jefferies E, and Lambon Ralph MA (2006). Semantic impairment in stroke aphasia versus semantic dementia: a case-series comparison. Brain 129, 2132–2147. [DOI] [PubMed] [Google Scholar]
  • 44.Meier EL, Lo M, and Kiran S (2016). Understanding semantic and phonological processing deficits in adults with aphasia: Effects of category and typicality. Aphasiology 30, 719–749. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Fama ME, Hayward W, Snider SF, Friedman RB, and Turkeltaub PE (2017). Subjective experience of inner speech in aphasia: Preliminary behavioral relationships and neural correlates. Brain Lang. 164, 32–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Goering S, Klein E, Specker Sullivan L, Wexler A, Agüera Y Arcas B, Bi G, Carmena JM, Fins JJ, Friesen P, Gallant J, et al. (2021). Recommendations for Responsible Development and Application of Neurotechnologies. Neuroethics 14, 365–386. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Huang S, Paul U, Gupta S, Desai K, Guo M, Jung J, Capestany B, Krenzer WD, Stonecipher D, and Farahany N (2024). U.S. public perceptions of the sensitivity of brain data. J Law Biosci 11, lsad032. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Ligthart S, Ienca M, Meynen G, Molnar-Gabor F, Andorno R, Bublitz C, Catley P, Claydon L, Douglas T, Farahany N, et al. (2023). Minding Rights: Mapping Ethical and Legal Foundations of “Neurorights”. Camb. Q. Healthc. Ethics, 1–21. [DOI] [PubMed] [Google Scholar]
  • 49.Nishimoto S, Vu AT, Naselaris T, Benjamini Y, Yu B, and Gallant JL (2011). Reconstructing visual experiences from brain activity evoked by natural movies. Curr. Biol 21, 1641–1646. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Yuan J, and Liberman M (2008). Speaker identification on the SCOTUS corpus. J. Acoust. Soc. Am 123, 3878. [Google Scholar]
  • 51.Boersma P, and Weenink D (2014). Praat: doing phonetics by computer. [Google Scholar]
  • 52.Kay KN, David SV, Prenger RJ, Hansen KA, and Gallant JL (2008). Modeling low-frequency fluctuation and hemodynamic response timecourse in event-related fMRI. Hum. Brain Mapp 29, 142–156. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Gao JS, Huth AG, Lescroart MD, and Gallant JL (2015). Pycortex: an interactive surface visualizer for fMRI. Front. Neuroinform 9, 23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Radford A, Narasimhan K, Salimans T, and Sutskever I (2018). Improving language understanding by generative pre-training. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf. [Google Scholar]
  • 55.Jain S, and Huth AG (2018). Incorporating Context into Language Encoding Models for fMRI. In Advances in Neural Information Processing Systems 31, pp. 6629–6638. [Google Scholar]
  • 56.Toneva M, and Wehbe L (2019). Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain). In Advances in Neural Information Processing Systems 32, pp. 14928–14938. [Google Scholar]
  • 57.Caucheteux C, and King J-R (2022). Brains and algorithms partially converge in natural language processing. Commun. Biol 5, 134. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ, et al. (2020). Array programming with NumPy. Nature 585, 357–362. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, et al. (2020). SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. [Google Scholar]
  • 61.Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, et al. (2020). Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. [Google Scholar]
  • 62.Holtzman A, Buys J, Du L, Forbes M, and Choi Y (2020). The Curious Case of Neural Text Degeneration. In 8th International Conference on Learning Representations. [Google Scholar]
  • 63.Fischl B, Sereno MI, Tootell RBH, and Dale AM (1999). High-resolution intersubject averaging and a coordinate system for the cortical surface. Hum. Brain Mapp 8, 272–284. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Holtzman A, Buys J, Du L, Forbes M, and Choi Y (2020). The Curious Case of Neural Text Degeneration. In International Conference on Learning Representations. [Google Scholar]
  • 65.Benjamini Y, and Hochberg Y (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Series B Stat. Methodol 57, 289–300. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

Table S2. Decoder predictions for a perceived story, related to Figure 2. Language decoders were evaluated on single-trial BOLD fMRI responses recorded while a goal participant listened to the test story “Where There’s Smoke” by Jenifer Hixson from The Moth Radio Hour. The actual stimulus words are shown alongside the decoder predictions from a within-participant decoder trained on 10 hours of story data from the goal participant, a cross-participant decoder trained on 70 minutes of story data from the goal participant, and a cross-participant decoder trained on 70 minutes of movie data from the goal participant.

2

RESOURCES