Skip to main content
. Author manuscript; available in PMC: 2023 Nov 16.
Published in final edited form as: J Med Syst. 2022 Nov 16;46(12):96. doi: 10.1007/s10916-022-01880-6

Algorithm 1.

DataSifterText - Masking and Prediction

1: Input: W, L, blacklist, whitelist, pw, pn, and replacement method.
2: Construct BERT on documents with downstream tasks as blank words prediction.
3: for i = 1,…, n do
4:  Set nmask = 0, coef = 1.2.
5: while nmask ≤ 0.5 * Ti do
6:   for t = 1,…, Ti do
7:    if wt ∈ whitelist then
8:     Set p = 1 − pw * coef.
9:    else if wt ∈ blacklist then
10:     Set p = 0.
11:    else
12:     Set p = 1 − pn * coef.
13:    end if
14:    Mask wt with probability p.
15:    if wt is masked then
16:     nmask = nmask+1
17:     coef = 1.2
18:    else if coef > 0.05 then
19:     coef = coef-0.05
20:    end if
21:   end for
22: end while
23: end for
24: for i = 1,…, n do
25: for t = 1,…, Ti do
26:   if wt = [MASK] then
27:    Use a trained BERT model to generate P(wt|Wi).
28:    Sample one token with the obtained distribution P(wt|Wi) and replace the [MASK] token at location t.
29:   end if
30: end for
31: end for
32: if Replacement method ≠ (No obfuscation,0) then
33:  Construct D(W).
34:  Use Mini Batch K-means to classify documents into K clusters.
35:  Obtain RAKE(W) or TR(W) based on replacement method specification.
36: for i = 1,...,n do
37:   Sample min(1,000,nWi) documents in the same cluster as Wi, where nWi is the number of documents in the same cluster as Wi, such that the document index j 6= i.
38:   for j = 1,...,1,000 do
39:    dist(i,j)=D(W)iD(W)jD(W)iD(W)j
40:   end for
41:   Sample one document from {Wj : dist(i,j) within the smallest top 10% ∀j} as the replacement partner for document i and denote as Wj*.
42:   if Replacement method = (RAKE keyphrase,q) then
43:    Replace RAKE(Wi)q with RAKE(Wj*)q
44:   else if Replacement method = (RAKE index,1) then
45:    Replace all tokens from RAKE(Wi)1 to WTi with RAKE(Wj*)1 to WTj*
46:   else
47:    Replace TR(Wi)q with TR(Wj*)q.
48:   end if
49: end for
50: end if
51: Output: W* = W, and L* = L