Algorithm 1.
DataSifterText - Masking and Prediction
| 1: | Input: W, L, blacklist, whitelist, pw, pn, and replacement method. |
| 2: | Construct BERT on documents with downstream tasks as blank words prediction. |
| 3: | for i = 1,…, n do |
| 4: | Set nmask = 0, coef = 1.2. |
| 5: | while nmask ≤ 0.5 * Ti do |
| 6: | for t = 1,…, Ti do |
| 7: | if wt ∈ whitelist then |
| 8: | Set p = 1 − pw * coef. |
| 9: | else if wt ∈ blacklist then |
| 10: | Set p = 0. |
| 11: | else |
| 12: | Set p = 1 − pn * coef. |
| 13: | end if |
| 14: | Mask wt with probability p. |
| 15: | if wt is masked then |
| 16: | nmask = nmask+1 |
| 17: | coef = 1.2 |
| 18: | else if coef > 0.05 then |
| 19: | coef = coef-0.05 |
| 20: | end if |
| 21: | end for |
| 22: | end while |
| 23: | end for |
| 24: | for i = 1,…, n do |
| 25: | for t = 1,…, Ti do |
| 26: | if wt = [MASK] then |
| 27: | Use a trained BERT model to generate P(wt|Wi). |
| 28: | Sample one token with the obtained distribution P(wt|Wi) and replace the [MASK] token at location t. |
| 29: | end if |
| 30: | end for |
| 31: | end for |
| 32: | if Replacement method ≠ (No obfuscation,0) then |
| 33: | Construct D(W). |
| 34: | Use Mini Batch K-means to classify documents into K clusters. |
| 35: | Obtain RAKE(W) or TR(W) based on replacement method specification. |
| 36: | for i = 1,...,n do |
| 37: | Sample min(1,000,nWi) documents in the same cluster as Wi, where nWi is the number of documents in the same cluster as Wi, such that the document index j 6= i. |
| 38: | for j = 1,...,1,000 do |
| 39: | |
| 40: | end for |
| 41: | Sample one document from {Wj : dist(i,j) within the smallest top 10% ∀j} as the replacement partner for document i and denote as Wj*. |
| 42: | if Replacement method = (RAKE keyphrase,q) then |
| 43: | Replace RAKE(Wi)q with RAKE(Wj*)q |
| 44: | else if Replacement method = (RAKE index,1) then |
| 45: | Replace all tokens from RAKE(Wi)1 to with RAKE(Wj*)1 to |
| 46: | else |
| 47: | Replace TR(Wi)q with TR(Wj*)q. |
| 48: | end if |
| 49: | end for |
| 50: | end if |
| 51: | Output: W* = W, and L* = L |