Decoder predictions for a perceived story were compared to the actual stimulus words using a range of language similarity metrics. A floor for each metric was computed by scoring the mean similarity between the actual stimulus words and 200 null sequences generated from a language model without using any brain data. A ceiling for each metric was computed by manually translating the actual stimulus words into Mandarin Chinese, automatically translating the words back into English using a state-of-the-art machine translation system, and scoring the similarity between the actual stimulus words and the output of the machine translation system. Under the BERTScore metric, the decoder—which was trained on far less paired data and used far noisier input—performed around 20% as well as the machine translation system relative to the floor.