Table 1.

Features employed in the pairwise similarity model.

Rank in related articles. This employs the “Related citations” function within PubMed, in which every article has been scored for similarity to all other articles using a formula that takes into account weighted word similarity in title and abstract and takes Medical Subject Headings into account [13]. For each pair of articles p1 and p2 under consideration, we retrieve the rank scores of p1 relative to p2 and p2 relative to p1. Then our rank similarity score = min (rank(p1; p2); rank(p2; p1)), where rank(p1; p2) is the similarity rank of p2 in the similar publication list of p1. For example, if p1 is on the 3rd place of the related article list of p2, and p2 is on the 5th place of the related list of p1, then the rank similarity score should be 3. If rank(p1, p2) >20, i.e. the p2 is not in the top 20 list, we assign a large number 500 to it.
Number of shared author names. We utilized a resource, Author-ity, that identifies whether pairs of MEDLINE articles bearing the same author name (last name, first initial) were written by the same individual [8, 14]. However, because Author-ity does not contain the most recently published articles, when one or both articles were not listed in Author-ity we gave partial match scores depending on first name and middle initial matching, and added up the number of shared pairwise scores across all listed authors. That is, author_score(p1, p2) = Σ_i Σ_j score(name_i, name_j).

A score based on the presence of common authors (use the disambiguated author set Author-ity whenever possible). If the Author-ity database contains both publication p1 and p2, author_score = size(author_common) * 100. Otherwise,
${author}_{score} = author_score (p_{1}, p_{2}),$

, where author_score(p1, p2) is a score function based on the estimated disambiguated number of common author.

If the articles were not listed in Author-ity, score(name_i, name_j) was computed as follows:
1. if name_i, name_j share no last name, return 0
2. if they share last name and both have first name
  1. if first name match, return 100
  2. name with or without hyphen/space(jean-francois vs. jean francois or jean-francois vs. jean francois), return 43.9
  3. hyphenated name vs. name with hyphen and initial (jean-francois vs. jean-f, return 43.9
  4. hyphenated name with initial vs. name (jean-f vs. jean), return 43.9
  5. hyphenated name vs. first name only (jean-francois vs. jean), return 2.7
  6. nickname match (dave vs. david), return 0.56
  7. one edit distance (deletion :bjoern vs. bjorn, replacement :bjoern vs. bjaern, or flip order of two characters: bjoern vs. bjeorn), return 0.21
  8. name matches first part of other name and length > 2 (zak vs. zakaria), return 0.14
  9. name matches first part of other name and length = 2 (th vs. thomas), return 0.34
  10. 3-letter initials match (e.g., jean francois g vs. jfg), return 0.13
3. if not all of them have first name:
  1. if both of the first initials are available and are the same result+= 0.84
  2. if they share the same second initial, result+= 4.92
  3. if they all have second initial and not the same result −=4.16
Affiliation similarity. Affiliation fields were chunked by taking the text present within commas (this generally separates institution, city, country, etc.). The number of shared text chunks in the affiliation fields was scored.
Shared email. The number of identical email addresses in the affiliation fields was scored.
Publication type similarity. The number of shared entries in the publication type field was scored.
Support type similarity. The number of shared grant support types was scored.
Email domain. The number of shared email domains was scored (only for .com domain).
Shared country. Scored as 0 or 1 depending on whether the same country name appeared in the affiliation field.
Shared grant number. The number of shared grant numbers in the GR field was scored.
Shared substance names. The number of shared entries in the RN field was scored.
All-capitalized words in title. The names of clinical trials are commonly written in all-capitalized form. The number of shared words that are all-capitalized in the title (after stoplisting) was scored.
All-capitalized words in abstract or CN field. The number of shared words that are all-capitalized in the abstract (after stoplisting) was scored. We excluded words that are possible abbreviations (i.e. that are listed in the ADAM database) [15, 16]

(NCT numbers are identified from the SI field, or using regular expressions from within the abstract; two articles that match on NCT numbers definitely come from the same trial. This feature was not included when evaluating the model (Table 3), but will be added when Aggregator is deployed as a working tool.)