Skip to main content
. 2020 Jun 3;117(48):30046–30054. doi: 10.1073/pnas.1907367117

Fig. 3.

Fig. 3.

A high-level illustration of BERT. Words in the input sequence are randomly masked out and then all words are embedded as vectors in Rd. A Transformer network applies multiple layers of multiheaded attention to the representations. The final representations are used to predict the identities of the masked-out input words.