1 |
Character n-gram similarity |
Real |
Cosine similarity of letter n-grams of terms s and t (n=1, 2, 3). |
(0.954, 0.953, 0.951) |
(1.037, 3.838, 9.043) |
2 |
Normalized Levenshtein distance |
Real |
The minimum number of insertions, deletions and substitution operations necessary to transform one term into the other (Levenshtein, 1966), divided by the number of characters in the longer term. |
0.061 |
2.742 |
3 |
Jaro–Winkler similarity (Winkler, 1999) |
Real |
This metric considers the number of shared letters and transpositions between two terms; the metric also incorporates a formula to favor two terms that match from the beginning. |
0.979 |
−0.536 |
4 |
Word n-gram similarity |
Real |
Cosine similarity of word n-grams of terms s and t (n=1, 2, 3). |
(0.750, 0.667, 0.500) |
(0.457, −2.439, 0.523) |
5 |
SoftTFIDF (Cohen et al., 2003) |
Real |
This metric aligns tokens between two strings using the Jaro–Winkler similarity with threshold 0.9, and computes the sum of the similarity scores of aligned pairs; the similarity score is based on TFIDF scores. |
1.883 |
0.946 |
6 |
Bias |
Real |
This feature always yields 1. |
1 |
−9.340 |