Skip to main content
. 2013 May 28;21(1):105–110. doi: 10.1136/amiajnl-2012-001552

Table 1.

Performance on duplicate citation detection using various lengths of n-grams and their combination

  Unigrams Bigrams Trigrams
p R F1 p R F1 p R F1
NG 0.85 0.84 0.84 0.83 0.81 0.81 0.84 0.82 0.82
LM-NG 0.84 0.83 0.83 0.84 0.83 0.82 0.85 0.84 0.84
Del 0.83 0.81 0.81 0.83 0.81 0.81
LM-Del 0.84 0.82 0.82 0.84 0.83 0.83
Sub 0.96* 0.96* 0.96* 0.85 0.84 0.83 0.86 0.84 0.84
LM-Sub 0.96* 0.96* 0.96* 0.86* 0.86* 0.86* 0.86 0.84 0.84
Del+Sub 0.86* 0.85* 0.85* 0.87* 0.85* 0.85*
LM+Del+Sub 0.87* 0.87* 0.87* 0.86* 0.85* 0.85*
  Fourgrams Fivegrams Combined
p R F1 p R F1 p R F1
NG 0.83 0.82 0.82 0.84 0.83 0.83 0.92 0.91 0.91
LM-NG 0.84 0.83 0.82 0.84 0.84 0.84 0.93 0.93 0.93
Del 0.84 0.83 0.83 0.85 0.84 0.84 0.88 0.87 0.87
LM-Del 0.85* 0.84* 0.84* 0.85* 0.85* 0.85* 0.94 0.94 0.94
Sub 0.84 0.83 0.83 0.85 0.85 0.85 0.99* 0.99* 0.99*
LM-Sub 0.86* 0.86* 0.86* 0.83 0.83 0.83 0.99* 0.99* 0.99*
Del+Sub 0.86* 0.85* 0.85* 0.86* 0.86* 0.86* 0.99* 0.99* 0.99*
LM−Del+Sub 0.88* 0.87* 0.87* 0.85 0.84 0.84 0.99* 0.99* 0.99*
eTBlast 0.87 0.84 0.84

*Statistically significant improvement over the baseline approach (ie, NG) (Wilcoxon signed-rank test, p<0.05).

NG is the basic containment measure which does not make use of modified n-grams. Del, Sub, and Del+Sub indicate that modified n-grams are included, generated using deletion, substitution, or by combining both approaches. LM indicates that the n-grams are weighted using the language model.