Table 1.
Unigrams | Bigrams | Trigrams | |||||||
---|---|---|---|---|---|---|---|---|---|
p | R | F1 | p | R | F1 | p | R | F1 | |
NG | 0.85 | 0.84 | 0.84 | 0.83 | 0.81 | 0.81 | 0.84 | 0.82 | 0.82 |
LM-NG | 0.84 | 0.83 | 0.83 | 0.84 | 0.83 | 0.82 | 0.85 | 0.84 | 0.84 |
Del | – | – | – | 0.83 | 0.81 | 0.81 | 0.83 | 0.81 | 0.81 |
LM-Del | – | – | – | 0.84 | 0.82 | 0.82 | 0.84 | 0.83 | 0.83 |
Sub | 0.96* | 0.96* | 0.96* | 0.85 | 0.84 | 0.83 | 0.86 | 0.84 | 0.84 |
LM-Sub | 0.96* | 0.96* | 0.96* | 0.86* | 0.86* | 0.86* | 0.86 | 0.84 | 0.84 |
Del+Sub | – | – | – | 0.86* | 0.85* | 0.85* | 0.87* | 0.85* | 0.85* |
LM+Del+Sub | – | – | – | 0.87* | 0.87* | 0.87* | 0.86* | 0.85* | 0.85* |
Fourgrams | Fivegrams | Combined | |||||||
p | R | F1 | p | R | F1 | p | R | F1 | |
NG | 0.83 | 0.82 | 0.82 | 0.84 | 0.83 | 0.83 | 0.92 | 0.91 | 0.91 |
LM-NG | 0.84 | 0.83 | 0.82 | 0.84 | 0.84 | 0.84 | 0.93 | 0.93 | 0.93 |
Del | 0.84 | 0.83 | 0.83 | 0.85 | 0.84 | 0.84 | 0.88 | 0.87 | 0.87 |
LM-Del | 0.85* | 0.84* | 0.84* | 0.85* | 0.85* | 0.85* | 0.94 | 0.94 | 0.94 |
Sub | 0.84 | 0.83 | 0.83 | 0.85 | 0.85 | 0.85 | 0.99* | 0.99* | 0.99* |
LM-Sub | 0.86* | 0.86* | 0.86* | 0.83 | 0.83 | 0.83 | 0.99* | 0.99* | 0.99* |
Del+Sub | 0.86* | 0.85* | 0.85* | 0.86* | 0.86* | 0.86* | 0.99* | 0.99* | 0.99* |
LM−Del+Sub | 0.88* | 0.87* | 0.87* | 0.85 | 0.84 | 0.84 | 0.99* | 0.99* | 0.99* |
eTBlast | 0.87 | 0.84 | 0.84 |
*Statistically significant improvement over the baseline approach (ie, NG) (Wilcoxon signed-rank test, p<0.05).
NG is the basic containment measure which does not make use of modified n-grams. Del, Sub, and Del+Sub indicate that modified n-grams are included, generated using deletion, substitution, or by combining both approaches. LM indicates that the n-grams are weighted using the language model.