Skip to main content
. 2010 Mar 30;26(9):1246–1253. doi: 10.1093/bioinformatics/btq129

Table 1.

Features for the string similarity measure

# Feature Type Description Example Weight (w)
1 Character n-gram similarity Real Cosine similarity of letter n-grams of terms s and t (n=1, 2, 3). (0.954, 0.953, 0.951) (1.037, 3.838, 9.043)
2 Normalized Levenshtein distance Real The minimum number of insertions, deletions and substitution operations necessary to transform one term into the other (Levenshtein, 1966), divided by the number of characters in the longer term. 0.061 2.742
3 Jaro–Winkler similarity (Winkler, 1999) Real This metric considers the number of shared letters and transpositions between two terms; the metric also incorporates a formula to favor two terms that match from the beginning. 0.979 −0.536
4 Word n-gram similarity Real Cosine similarity of word n-grams of terms s and t (n=1, 2, 3). (0.750, 0.667, 0.500) (0.457, −2.439, 0.523)
5 SoftTFIDF (Cohen et al., 2003) Real This metric aligns tokens between two strings using the Jaro–Winkler similarity with threshold 0.9, and computes the sum of the similarity scores of aligned pairs; the similarity score is based on TFIDF scores. 1.883 0.946
6 Bias Real This feature always yields 1. 1 −9.340