. 2010 Mar 30;26(9):1246–1253. doi: 10.1093/bioinformatics/btq129

Table 1.

Features for the string similarity measure

#	Feature	Type	Description	Example	Weight (w)
1	Character n-gram similarity	Real	Cosine similarity of letter n-grams of terms s and t (n=1, 2, 3).	(0.954, 0.953, 0.951)	(1.037, 3.838, 9.043)
2	Normalized Levenshtein distance	Real	The minimum number of insertions, deletions and substitution operations necessary to transform one term into the other (Levenshtein, 1966), divided by the number of characters in the longer term.	0.061	2.742
3	Jaro–Winkler similarity (Winkler, 1999)	Real	This metric considers the number of shared letters and transpositions between two terms; the metric also incorporates a formula to favor two terms that match from the beginning.	0.979	−0.536
4	Word n-gram similarity	Real	Cosine similarity of word n-grams of terms s and t (n=1, 2, 3).	(0.750, 0.667, 0.500)	(0.457, −2.439, 0.523)
5	SoftTFIDF (Cohen et al., 2003)	Real	This metric aligns tokens between two strings using the Jaro–Winkler similarity with threshold 0.9, and computes the sum of the similarity scores of aligned pairs; the similarity score is based on TFIDF scores.	1.883	0.946
6	Bias	Real	This feature always yields 1.	1	−9.340