Algorithm 1.

Used to compute a probability score for a text document D given a masked language model M. The output of the model returned by a call to Forward is a matrix where each row maps to a distribution over all the tokens in the vocabulary. The Append function adds a value to the end of a list.

procedure Masked-Prob(D, M)

sents ← Sentence-Split(D)

P ← Initialize empty list

for i = 1 … |sents| do

T ← Tokenize(sents[i])

for j = 1 … 10 do

A ← sample 15% from 1… |T|

T′ ← T

for all a ∈ A do

T′[a] ← [MASK]

outputs ← Forward(M, T′)

for all a ∈ A do

prob ← outputs[a][T[a]]

Append(P, prob)

return mean(P)