. Author manuscript; available in PMC: 2024 Aug 7.

Published in final edited form as: Nat Neurosci. 2023 May 1;26(5):858–866. doi: 10.1038/s41593-023-01304-9

Table 1. Language similarity scores.

Decoder predictions for a perceived story were compared to the actual stimulus words using a range of language similarity metrics. A floor for each metric was computed by scoring the mean similarity between the actual stimulus words and 200 null sequences generated from a language model without using any brain data. A ceiling for each metric was computed by manually translating the actual stimulus words into Mandarin Chinese, automatically translating the words back into English using a state-of-the-art machine translation system, and scoring the similarity between the actual stimulus words and the output of the machine translation system. Under the BERTScore metric, the decoder—which was trained on far less paired data and used far noisier input—performed around 20% as well as the machine translation system relative to the floor.

	WER	BLEU-1	METEOR	BERTScore
Null	0.9637	0.1908	0.1323	0.7899
Subject 1	0.9407	0.2331	0.1621	0.8077
Subject 2	0.9354	0.2426	0.1677	0.8104
Subject 3	0.9243	0.2470	0.1703	0.8116
Translation	0.7459	0.4363	0.3991	0.8797