A new approach and gold standard toward author disambiguation in MEDLINE

Dina Vishnyakova; Raul Rodriguez-Esteban; Fabio Rinaldi

doi:10.1093/jamia/ocz028

. 2019 Apr 8;26(10):1037–1045. doi: 10.1093/jamia/ocz028

A new approach and gold standard toward author disambiguation in MEDLINE

Dina Vishnyakova ^1,^✉, Raul Rodriguez-Esteban ¹, Fabio Rinaldi ^2,^3,⁴

PMCID: PMC7647200 PMID: 30958542

Abstract

Objective

Author-centric analyses of fast-growing biomedical reference databases are challenging due to author ambiguity. This problem has been mainly addressed through author disambiguation using supervised machine-learning algorithms. Such algorithms, however, require adequately designed gold standards that reflect the reference database properly. In this study we used MEDLINE to build the first unbiased gold standard in a reference database and improve over the existing state of the art in author disambiguation.

Materials and Methods

Following a new corpus design method, publication pairs randomly picked from MEDLINE were evaluated by both crowdsourcing and expert curators. Because the latter showed higher accuracy than crowdsourcing, expert curators were tasked to create a full corpus. The corpus was then used to explore new features that could improve state-of-the-art author disambiguation algorithms that would not have been discoverable with previously existing gold standards.

Results

We created a gold standard based on 1900 publication pairs that shows close similarity to MEDLINE in terms of chronological distribution and information completeness. A machine-learning algorithm that includes new features related to the ethnic origin of authors showed significant improvements over the current state of the art and demonstrates the necessity of realistic gold standards to further develop effective author disambiguation algorithms.

Discussion and Conclusion

An unbiased gold standard can give a more accurate picture of the status of author disambiguation research and help in the discovery of new features for machine learning. The principles and methods shown here can be applied to other reference databases beyond MEDLINE. The gold standard and code used for this study are available at the following repository: https://github.com/amorgani/AND/

Keywords: author name disambiguation, MEDLINE, text mining, gold standard, machine learning

INTRODUCTION

Many scientific authors share the same first name and last name (or the same first name, initial, and last name) with other authors. Thus, a scientific publication cannot frequently be associated unequivocally with a particular author. This problem is called author ambiguity and it is especially acute in MEDLINE because a significant part of MEDLINE queries are based on author names.¹ To solve this problem the adoption of author unique identifiers, such as ORCID, have been proposed, but such identifiers are not yet in widespread use. Alternatively, the task of author disambiguation tries to automatically resolve publication authorship using machine-learning algorithms.

Author disambiguation algorithms typically use features derived from publications such as titles, abstracts, affiliations, keywords, journal names, co-authors, references, annotated concepts, and publication years. These algorithms can be split into two types: unsupervised (eg, clustering) and supervised, which require gold standard data sets for their training and testing. Supervised algorithms have been shown to be superior to unsupervised algorithms in author disambiguation. However, unsupervised algorithms dispense with the need of a gold standard and, therefore, are not biased by the composition of the gold standard. In addition to supervised and unsupervised algorithms, there have also been attempts to solve the author disambiguation problem with the help of auxiliary sources such as Google Scholar and ResearchGate² or by correlating citations with web searches.^3–7

Current author disambiguation algorithms report accuracies usually higher than 90%, and in some cases more than 96%.^8–16 However, these results are difficult to evaluate because they have been produced using several different data sets, such as MEDLINE, DBLP, CiteSeer, or in-house data sets. Moreover, as far as we can tell, none of the author disambiguation algorithms described in the scientific literature has been trained and tested on an unbiased gold standard data set, which leads to an inexact picture of their performance. In fact, representative gold standards are not available for a real-world assessment of author disambiguation methods.

An example of an existing author disambiguation gold standard is the Arnetminer corpus,¹⁷ which is based on publications extracted from DBLP, IEEE, and ACM and contains 6730 references. One of the preliminary steps in its creation was to eliminate all articles with incomplete information. Another example is the KISTI data set,¹⁸ which was derived from DBLP and contains 37 613 publications from 6921 different authors. Publications in KISTI were not selected randomly but were based on name popularity, and 2007 was the publication year cutoff. The motivation behind the creation of this data set was to achieve higher author name diversity than in existing gold standards, such as by including more non-English names. Experiments on KISTI have shown that disambiguation performance depends on author ambiguity and not on the number of same-name author instances.¹⁸ Based on this, it should be easier to disambiguate 10 publications bearing the same last name and first initial such as “Drombovski, J” than 4 publications with a more ambiguous author name such as “Wang, Y.” Another data set, Culotta-REXA, was created by Culotta et al.¹¹ and is based on the Penn, Rexa, and DBLP data sets. Culotta et al. seem to have dropped publications with incomplete information (eg, authors with missing affiliation).

The only existing gold standard for MEDLINE (which we will call SONG) was created by Song et al.⁸ in 2013. This data set consists of a limited selection of author names associated with 385 first authors and 2875 publications. Moreover, it only contains publications that include information about author affiliation and abstract. Information about affiliation (mainly contact details), however, is often missing in MEDLINE. In fact, until 2014 only the affiliation of the first author was included and full author names were not systematically included until 2002.

As an alternative to SONG, we describe here a new gold standard data set based on MEDLINE that can help provide a more accurate view of the state of the art in author disambiguation for MEDLINE. Additionally, our aim is to help guide the design of gold standards for other reference databases. Our gold standard is based on authors drawn randomly from the entire MEDLINE database, including non-first authors. It does not, unlike other gold standards, overrepresent authors with known affiliation or restrict publications to those within a range of years.¹⁹ To build this gold standard we explored both crowdsourcing and expert curation.

Recently, crowdsourcing has been successfully adopted by different biomedical communities. For instance, MacLean et al.²⁰ describe a crowdsourcing-based approach to identify medical terms in patient-authored texts, demonstrating that crowdsourcing reduces the time required for text processing from 2 weeks to 17 hours while achieving an F score of 84%. In another example, described in Bravo et al.,²¹ the objective of the crowdsourcing curators was to judge sentences containing a putative chemical-induced disease relation. A total of 290 curators were engaged, but only 46% of them managed to answer a set of preliminary test questions. The curation results led to an F-measure of 76.8%. Such examples of successful applications encouraged us to explore crowdsourcing for the creation of an author disambiguation gold standard. Thus, we tested different crowdsourcing platforms and compared them to the manual curation performed by experts.

In addition to describing in detail how our gold standard was built, we demonstrate that a state-of-the-art author disambiguation algorithm,²² trained and tested on such a gold standard, performs worse than the same algorithm trained and tested in the SONG gold standard—because our gold standard reflects more closely the challenge of disambiguating MEDLINE. This result shows that SONG is a gold standard that is “easier” to handle by an algorithm due to higher data completeness and biased author selection that does not reflect the realities of MEDLINE data incompleteness and author name distribution. Therefore, we believe that our corpus can be used to train algorithms that will be able to perform better on real-world data.

Moreover, the algorithm trained in our gold standard relies on different features for its predictions. Additionally, to demonstrate the importance of ethnic composition in our gold standard, as discussed by Strotmann and Zhao,²³ we introduced 2 new features which improved the algorithm’s results, one based on the level of name ambiguity and another one based on name length. Our results stress the importance of using unbiased gold standards to identify useful new features and to assess the real-world practical performance of author disambiguation algorithms, particularly with databases containing limited affiliation information such as MEDLINE. Besides, care should be taken that author disambiguation gold standards represent the ethnic mix present in reference databases because names in certain cultures are known to be far more ambiguous than in others.

MATERIALS AND METHODS

We used the MEDLINE baseline of 2016 for all our analyses. Following Han, Giles et al.,¹⁴ we created our gold standard using the namespace method. This approach is the most widely adopted within the author disambiguation community. A namespace is the set of all publications authored by individuals with the same last name and first initial. Thus, a publication may belong to as many namespaces as it has authors. Namespaces with only 1 publication are ignored for the purpose of author disambiguation as the names associated with them are not considered to be ambiguous within the database. We normalized the orthography of author names when needed; for instance, we provided non-accented equivalents to accented characters.

Once we created namespaces for all MEDLINE references, we selected namespaces randomly (with replacement) among all namespaces, with the probability of a namespace being selected based on its size, so that larger namespaces were more likely to be selected. Within each selected namespace we then randomly picked a pair of publications. Unlike in SONG, we considered relevant all authors of a paper, and not only the first author. Specific rules were applied to filter out pairs which very likely belonged to the same author; for instance, pairs with the same first name and the same affiliation or email. The remaining publication pairs were then sent to curators whose task was to determine whether the authors in the publication pairs were the same person or not. Each randomly selected publication pair was evaluated (“judged”) by 3 curators. This method ensures that the publication pairs reviewed by curators represent a randomly assigned sample of MEDLINE authors (see Table 1).

Table 1.

Stepwise description of our corpus creation method

Step 1	Extract all {author name, PubMed ID} pairs in MEDLINE, including non-first authors
Step 2	Group all {author name, PubMed ID} pairs based on author last name and first initial
Step 3	Pick at random a group (ie, namespace) based on group size
Step 4	Pick at random an {author name, PubMed ID} pair from the namespace.
Step 5	Pick at random another {author name, PubMed ID} pair from the same namespace.
Step 6	Go back to Step 3 until enough samples for the corpus are generated

Open in a new tab

Curators were shown publication pairs and asked to judge whether two authors were, in fact, the same person. Our preliminary tests showed that giving the curators additional information, such as journal name, article title, or publication name, confused them and led them to make judgments relying only on the information we had provided. Thus, we decided to provide only links to the publications in PubMed along with the ambiguous author name, so that curators could determine on their own which information to use to make a judgment, with the aid of the internet if necessary. Additionally, a set of training instructions were provided to the curators, which included practical examples with explanations. For instance, if a pair of publications is mentioned on an author’s list of publications in Google Scholar or ResearchGate, then they belong to the same person. In addition to evaluating the curators’ performance, we also measured the time they spent on each judgment. Overall, we explored two types of curators: expert curators and crowdsourcing. Expert curators were asked to evaluate publication pairs in a similar fashion as crowdsourcing curators. The experts were graduate students with an applied mathematics background at a European university. They were trained and tested initially using a training data set to ensure the quality of their performance.

We compared our resulting gold standard to the SONG gold standard.⁸ It should be noted that the article in which this gold standard is described⁸ does not provide information on the method used to generate publication pairs or the quantity of them used for training and testing. To reproduce it, we decided to apply a straightforward method that took into account all the potential pairs that could be formed between the authors in the publications listed. Unlike in our previous work,²² we made sure that the namespaces from the pairs selected for training and for testing were different to avoid data “leaks.” This, not unexpectedly, led to lower performance of the author disambiguation algorithm. Overall, the SONG gold standard was based on 3345 publication pairs for testing and 54 974 for training. Publication pairs belonging to different authors represented 74.6% of all pairs in the SONG gold standard.

RESULTS

We calculated that there are 3 371 838 namespaces overall in MEDLINE (2016 baseline). Of those, 1 257 865 are namespaces with only 1 publication, which were excluded from our focus. Figure 1 shows the namespaces with the highest degree of ambiguity. All top 30 most ambiguous namespaces are of East Asian origin. In fact, of the top 200 most ambiguous namespaces, 85% are of East Asian origin. Figure 2 shows the overall distribution of namespace sizes and therefore the level of author ambiguity that exists in MEDLINE, which follows a power law, as described by Torvik et al.¹⁰

Figure 1. — Distribution of the most ambiguous namespaces in MEDLINE. Number of publications per namespace for the largest namespaces.

Figure 2. — Distribution of namespace sizes in MEDLINE. The horizontal axis denotes the number of publications per namespace, and the vertical axis denotes the number of namespaces.

Our first attempt to create a gold standard involved the use of crowdsourcing. To test its feasibility we utilized 117 publication pairs coming from 38 namespaces of mixed ethnicity, of which, for example, 11 namespaces were of East Asian origin. Of the 117 publication pairs, 55 belonged to different authors and 62 to the same author.

We explored 2 crowdsourcing platforms, Figure Eight (previously known as CrowdFlower) and Amazon Mechanical Turk (MTurk). The main difference between these two platforms is that Figure Eight automatically filters curators by randomly including test tasks in which they are required to meet a minimum threshold of accuracy.

The Figure Eight platform requires the setup of training and test tasks to filter the curators (“contributors”) who will perform the tasks. We set a threshold of performance accuracy on the test tasks to 85%. Contributors who did not meet that level of performance were excluded. We set the initial price for an accepted judgment to $0.02. Overall, 199 contributors participated in the curation task, of which only 88 passed the test tasks. Each contributor provided between 5 and 75 judgments (see Figure 3). The average time spent judging a publication pair was 5.2 minutes. The minimum time was 1 minute and the maximum, 26 minutes.

Figure 3. — Distribution of judgments per contributor. Each bar represents a contributor and the number of judgments they submitted. Contributors who had a low trust score and submitted a significantly larger amount of judgments than other contributors were likely trying to abuse the system, and their judgments were excluded.

We received a total of 351 judgments from Figure Eight contributors. After manually verifying their correctness, we calculated the accuracy of their judgments to be 70.3%. The best accuracy achieved by a contributor was 86%. However, we should note that none of the participants completed the full set of 117 pairs.

Unlike Figure Eight, MTurk does not provide a built-in test to identify the performance of its curators (“workers”). There are, instead, statistics available on the performance of each worker in past submissions. However, we have observed that workers’ performance varies widely depending on the task. We set the initial price to $0.05 per judgment. Overall, 102 workers participated in the task. We rejected 35% of the workers because they performed with less than 70% accuracy in test tasks or made judgments in less than 1 minute. Most of the latter were made in less than 5 seconds.

We received 351 judgments for our 117 publication pairs. The overall accuracy of the judgments was 65%. The best worker accuracy was 75%. As in Figure Eight, none of the workers provided judgments for the whole set of publication pairs (see Figure 4). Since the judgment of crowdsourcing curators was not reliable enough after the test tasks and the best judgment accuracy achieved by a crowdsourcing curator was disappointing, we hired 3 experts to curate the same 117 pairs. We pre-selected these curators with a test task and periodically controlled their output. They were provided with the same examples as the crowdsourcing curators, but each of the experts was required to curate the whole data set. Thus, 351 judgments came from only 3 experts. The accuracy of the judgments provided by the experts was 94.9%. One of the experts achieved an accuracy of 97%.

Figure 4. — Number of judgments per contributor in MTurk. Each bar represents an MTurk contributor and the number of judgments they submitted.

We used Fleiss’ kappa as a statistical measure to assess the level of agreement of the experts and obtained a kappa value equal to 0.97, which according to the interpretation of Landis and Koch²⁴ represents an almost perfect agreement. It should be noted that the number of classes and pairs may affect the magnitude of the value. Thus, kappa is higher when there are fewer classes. The experts also provided feedback about the time they spent curating a pair, which ranged between 4 and 34 minutes. Since the experts showed better performance than the crowdsourcing curators, we decided to curate our gold standard data set using expert curators rather than crowdsourcing. We prepared a set of 1500 randomly selected publication pairs to be evaluated by the experts for the gold standard. Then, we trained and validated an author disambiguation algorithm using those 1500 pairs and later requested judgments from the experts for an additional set of 400 pairs that we used as a never-seen-before final test set (FTS). The complete gold standard is, therefore, composed of 1900 pairs.

Out of the total of 1900 publication pairs curated by the experts, we received 5700 judgments. Majority vote was used to decide on whether a publication pair belonged to the same author, which 65% did. We measured the level of curator agreement by applying Fleiss’ kappa and the obtained kappa value was 0.96. We also asked the curators to comment on whether they based their judgments on information from the well-known publication directories ResearchGate and Google Scholar. According to the curators, 19% of their judgments were based on information from ResearchGate and 3% from Google Scholar. The rest of the judgments were based either on other author publication lists published on the internet or on information found in social or professional networks (eg, LinkedIn). For further validation, we emailed the authors associated with the 265 email addresses available in our gold standard. We received back 65 responses, all of which except 1 were in agreement with our experts’ consensus judgments (98% agreement).

We proceeded to compare our gold standard to SONG. As can be seen in Table 2, the completeness of the data differs greatly between the two. It is important to note that greater lack of information (eg, missing email, affiliation) is a desirable quality of a training corpus in our case, because it implies greater variability and closeness to the original data, and it offers the opportunity for a more challenging disambiguation task.

Table 2.

Completeness statistics of our gold standard and the SONG gold standard

Data sets	Email	Affiliation	Full first name	First author
*Our gold standard*	7.9%	35.9%	35.4%	29.6%
*SONG*	57.0%	95.0%	64.9%	100.0%

Open in a new tab

To show that our gold standard is representative of MEDLINE, we analyzed the distribution of publications by year (see Figure 5). Our results show a better alignment of our gold standard with the MEDLINE data set, particularly in comparison to SONG, which was created with a focus on the most productive authors with known affiliation.

We then proceeded to evaluate the performance of state-of-the-art author disambiguation algorithms on our gold standard. Following our previous work, we deployed a decision tree algorithm called C4.5 and trained it on features that we extracted from the publication pairs as described in our previous work.²² Thus, the following features were extracted: author first and last names, initials, affiliation, type of organization (university, hospital, research center, etc.), publication year, email, location, co-authors, journal descriptors, and semantic types.

Because our expert curators reported more difficulty judging pairs from East Asian namespaces, which comprise the largest namespaces, we created a new feature that we called ambiguity score, which is the ratio between the number of publications in a given namespace and the total number of publications in the complete data set—in our case MEDLINE. Moreover, we included a new feature representing the length of last names, which we thought of as a simple proxy for an ethnicity indicator because East Asian last names (in particular Korean and Chinese) tend to be short.

Table 3 shows the results achieved on our never-before-seen FTS using a model trained on the rest of our gold standard and a model trained on SONG. These two models were built using the ambiguity score and last name length features. The model built on our data outperforms the one built on SONG across precision, recall, and F score.

Table 3.

Comparison of two models on the never-before-seen FTS of 400 pairs

	Models
	trained on SONG	trained on ours
*Precision*	0.623	0.827
*Recall*	0.583	0.922
*F score*	0.602	0.872

Open in a new tab

We also compared how a prediction would fare when trained on SONG and tested on our gold standard, and vice versa. This comparison involves 600 randomly selected pairs from each corpus, 300 of them positive and 300 of them negative. The results achieved from this experiment are represented in Table 4. Training on our corpus slightly outperforms the training on SONG.

Table 4.

Comparison of results achieved across data sets on a limited slice of data (600 pairs, 300 of which were positive and 300 were negative)

	Models
	trained on SONG tested on our gold standard	trained on our gold standard tested on SONG
*Precision*	0.459	0.731
*Recall*	0.500	0.387
*F score*	0.478	0.505

Open in a new tab

In order to evaluate the value of the old features, as well as the value of the new ones, we trained and tested our author disambiguation algorithm in our gold standard and SONG and evaluated the gain ratio for each feature with respect to class (see Table 5). The results of this evaluation showed that feature value differs greatly depending on the gold standard involved. On the list of the top 5 features in gain ratio there was only 1 shared by both gold standards, namely first name. The most relevant feature for SONG, affiliation, had minimal importance for our gold standard. Semantic types and journal descriptors (see our previous work²²) were also much more prominent for our gold standard (10th and 11th rank for SONG, 1st and 2nd for ours). Our new feature, ambiguity score, proved to be more important in our gold standard than in SONG (4th rank compared to 7th) and the value of our other new feature, last name length, was also higher for our gold standard than for SONG. The semantic types and journal descriptors features are useful particularly when there is lack of affiliation information. The ambiguity score is related to ethnicity.

Table 5.

Results of gain ratio attribute evaluation on SONG and our gold standard ranked by feature merit. The largest changes in the ranking between our gold standard and SONG correspond to the features Journal descriptors, Semantic types, and Last name length

SONG		Rank	Our gold standard
Average merit	Feature name	Rank	Feature name	Average merit
0.24784	Affiliation	1	Journal descriptors	0.24454
0.19890	First name	2	Semantic types	0.19500
0.18686	Type of organization	3	Co-authors	0.18682
0.04244	Country	4	Ambiguity score	0.16262
0.03887	City	5	First name	0.15089
0.02825	Co-authors	6	Last name length	0.11874
0.02644	Ambiguity score	7	Years’ difference	0.07097
0.02635	Years’ difference	8	City	0.01636
0.02385	Email	9	Type of organization	0.01434
0.01446	Journal descriptors	10	Language	0.00965
0.00892	Semantic types	11	Country	0.00686
0.00639	Last name length	12	Initials	0.00468
0.00014	Initials	13	Affiliation	0.00000
0.00006	Language	14	Email	0.00000

Open in a new tab

The results of 10-fold cross-validation performed on SONG and on our gold standard are presented in Table 6. This table also shows the effect of including the two additional features, namely ambiguity score and last name length. These results demonstrate that these additional features improve performance metrics. While precision was highest on SONG with the features last name length and ambiguity score, this has to be understood in the context of a lower recall. There is a trade-off between these two quantities and therefore the best way to compare performance is to consider a monadic metric such as the F score.

Table 6.

Evaluation of C4.5 models using 10-fold cross-validation. Results are shown with and without our newly proposed features, ambiguity score and last name length

	trained on SONG				trained on our gold standard (1500 pairs)
	baseline	+last name length	+ambiguity score	+last name length +ambiguity score	baseline	+last name length	+ambiguity score	+last name length +ambiguity score
*Precision*	0.866	0.923	0.900	0.946	0.874	0.891	0.911	0.901
*Recall*	0.824	0.815	0.830	0.853	0.891	0.922	0.907	0.923
*F score*	0.845	0.866	0.864	0.897	0.882	0.906	0.909	0.912

Open in a new tab

DISCUSSION AND CONCLUSION

This study describes a new gold standard data set for disambiguating authors in MEDLINE. Our work stems from our analysis of the existing SONG gold standard, as we realized that it has a focus on first authors and publications with affiliation information. Even though most recent publications in MEDLINE include some affiliation information, there are still publications with no or incomplete affiliation. Moreover, the choice of authors that was made in SONG was not necessarily a good reflection of the authors present in MEDLINE. Looking at the distribution of publications over time we show, additionally, that SONG overrepresents more recent publications.

We designed our corpus to avoid any potential biases in order to support a realistic picture of what could be achieved using the state of the art in author disambiguation for the whole of MEDLINE. Methods that focus on specific countries and/or author types bear the risk of presenting an overly optimistic picture of the current status of author disambiguation.²⁵ An unbiased author disambiguation gold standard corpus should reflect the level of data completeness in the data set and the ethnic and geographic variation of authors, as well as the affiliation information completeness. The method we chose ensures that an appropriate mix of authors and publications is present.

To build the corpus we first conducted experiments with crowdsourcing platforms. The disappointing performance produced by curators from two well-known crowdsourcing platforms led us to think that crowd-based solutions are not suitable for this kind of task. Our results revealed that most of the crowdsourcing curators tried to guess the answer rather than spend time on information search. Therefore, we could not rely on such data.

Because we noted that the topmost ambiguous names in MEDLINE are of East Asian origin, we introduced two new features, called ambiguity score and last name length. These features led to a considerable improvement of our algorithm. Moreover, feature importance analysis showed that the disambiguation of publications with incomplete information (eg, missing affiliation) is dependent on other features. Indeed, when affiliation information is present, the author disambiguation problem becomes less challenging, but, because authors can change affiliation, additional features are still helpful. This can be seen in the comparison of the results with our corpus versus SONG, in which most authors have affiliation information.

The large difference in performance on the FTS between a model trained on our corpus and a model trained on SONG is evidence that gold standards that are not similar enough to their reference database do not provide a good benchmark for author disambiguation and justify the design and creation of representative gold standards that can yield an accurate portrait of the current state of the art in author disambiguation.

Although the process to create our gold standard was carefully planned, we are aware of some of its limitations and shortcomings. First of all, our gold standard is based on the namespace method. This method carries the risk of wrongly classifying authors whose names are erroneously entered in the database, such as those with a first name mistakenly recorded as a last name, or a first name with an error on the first letter. Second, our method does not allow for the disambiguation of authors whose name has changed (eg, after marriage). Third, the size of the corpus is limited to 3800 publications (including 800 publications from the FTS). This size is smaller than that of the SONG corpus and might not be optimal to achieve the best performance of author disambiguation algorithms. Fourth, we designed our gold standard using publication pairs in order to solve the author disambiguation problem as a binary classification problem. This would not be ideal for clustering methods, since they may require more publications belonging to the same authors. Fifth, while there is no better validation method than contacting the authors themselves, the responses that we received cannot be considered a random sample as they came from authors with a listed email address and who were willing to respond to our request. There is likely a certain error rate in creating gold standards of this kind that is very difficult to avoid because some cases are inherently very difficult to curate (eg, lack of affiliation, transliterated Asian-origin names, or publications with non-English abstracts). Ultimately, judgments are based on decisions made by human experts.

Even though the focus of this work was on the creation of a gold standard set for author disambiguation in MEDLINE, the principles presented in this article can be applied to other reference databases. We believe that the continuous improvement of author disambiguation algorithms will enable the advancement of author-centered, large-scale, scientific analyses.²⁶^,²⁷

FUNDING

The first author was funded by a Roche Postdoctoral Fellowship.

AUTHOR CONTRIBUTIONS

DV, FR, and RR were involved in conceptualization, design and analysis, DV in implementation, and data collection. All contributed to manuscript drafting and revisions.

ACKNOWLEDGMENTS

We would like to thank Khan Ozol for his contribution in getting this project started.

CONFLICT OF INTEREST STATEMENT

None declared.

REFERENCES

1. Islamaj Dogan R, Murray GC, Névéol A, et al. Understanding PubMed user search behavior through log analysis. Database 2009; 2009: bap018. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Abdulhayoglu MA, Thijs B.. Use of ResearchGate and Google CSE for author name disambiguation. Scientometrics 2017; 1113: 1965–85. [Google Scholar]
3. Yang K-H, Peng H-T, Jiang J-Y, et al. Author name disambiguation for citations using topic and web correlation. In: International Conference on Theory and Practice of Digital Libraries. Aarhus, Denmark: Springer; 2008. [Google Scholar]
4. Tan YF, Kan MY, Lee D.. Search engine driven author disambiguation. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries. Chapel Hill, NC, USA: ACM; 2006. [Google Scholar]
5. Yang K-H, Jiang J-Y, Lee H-M, et al. Extracting Citation Relationships from Web Documents for Author Disambiguation. Technical Report (TR-IIS-06-017). Taipei, Taiwan: Institute of Information Science, Academia Sinica; 2006. [Google Scholar]
6. Kanani P, McCallum A.. Efficient strategies for improving partitioning-based author coreference by incorporating Web pages as graph nodes. In: Proceedings of AAAI 2007 Workshop on Information Integration on the Web; 2007. [Google Scholar]
7. Lu Y, Nie Z, Cheng T, et al. Name disambiguation using web connection In: Proceedings of AAAI 2007 Workshop on Information Integration on the Web; 2007. [Google Scholar]
8. Song M, Kim EH-J, Kim HJ.. Exploring author name disambiguation on PubMed-scale. J Informetrics 2015; 94: 924–41. [Google Scholar]
9. Han H, Xu W, Zha H, et al. A hierarchical naive Bayes mixture model for name disambiguation in author citations In: Proceedings of the 2005 ACM Symposium on Applied Computing; 2005: 1065–9. [Google Scholar]
10. Torvik VI, Smalheiser NR.. Author name disambiguation in MEDLINE. ACM Trans Knowl Discov Data 2009; 33: 1.. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Culotta A, Kanani P, Hall R, et al. Author disambiguation using error-driven machine learning with a ranking loss function In: Sixth International Workshop on Information Integration on the Web (IIWeb-07); 2007. [Google Scholar]
12. Ferreira AA, Veloso A, Gonçalves MA, et al. Effective self-training author name disambiguation in scholarly digital libraries In: Proceedings of the 10th Annual Joint Conference on Digital Libraries; 2010. [Google Scholar]
13. Huang J, Ertekin S, Giles CL.. Efficient name disambiguation for large-scale databases In: European Conference on Principles of Data Mining and Knowledge Discovery; 2006. [Google Scholar]
14. Han H, Giles L, Zha H, et al. Two supervised learning approaches for name disambiguation in author citations. Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries; 2004. [Google Scholar]
15. Treeratpituk P, Giles CL.. Disambiguating authors in academic publications using random forests In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries; 2009. [Google Scholar]
16. Liu W, Islamaj Doğan R, Kim S, et al. Author name disambiguation for PubMed. J Assoc Inf Sci Technol 2014; 654: 765–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Tang J, Fong AC, Wang B, et al. A unified probabilistic framework for name disambiguation in digital library. IEEE Trans Knowl Data Eng 2012; 246: 975–87. [Google Scholar]
18. Kang I-S, Kim P, Lee S, et al. Construction of a large-scale test set for author disambiguation. Inf Process Manage 2011; 473: 452–65. [Google Scholar]
19. Zhao D, Strotmann A.. Counting first, last, or all authors in citation analysis: A comprehensive comparison in the highly collaborative stem cell research field. J Am Soc Inf Sci 2011; 624: 654–76. [Google Scholar]
20. MacLean DL, Heer J.. Identifying medical terms in patient-authored text: a crowdsourcing-based approach. J Am Med Inform Assoc 2013; 206: 1120–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Bravo À, Li TS, Su AI, et al. Combining machine learning, crowdsourcing and expert knowledge to detect chemical-induced diseases in text. Database 2016; 2016: baw094.. [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Vishnyakova D, Rodriguez-Esteban R, Ozol K, et al. Author name disambiguation in MEDLINE based on journal descriptors and semantic types In: Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016); 2016. [Google Scholar]
23. Strotmann A, Zhao D.. Author name disambiguation: what difference does it make in author‐based citation analysis? J Am Soc Inf Sci Tec 2012; 639: 1820–33. [Google Scholar]
24. Fleiss JL, Cohen J.. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educ Psychol Measur 1973; 333: 613–9. [Google Scholar]
25. Lerchenmueller MJ, Sorenson O.. Author disambiguation in pubmed: evidence on the precision and recall of author-ity among NIH-funded scientists. PLoS One 2016; 117: e0158731.. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Cokol M, Rodriguez-Esteban R.. Visualizing evolution and impact of biomedical fields. J Biomed Inform 2008; 416: 1050–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Rodriguez-Esteban R, Loging WT.. Quantifying the complexity of medical research. Bioinformatics 2013; 2922: 2918–24. [DOI] [PubMed] [Google Scholar]

[ocz028-B1] 1. Islamaj Dogan R, Murray GC, Névéol A, et al. Understanding PubMed user search behavior through log analysis. Database 2009; 2009: bap018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocz028-B2] 2. Abdulhayoglu MA, Thijs B.. Use of ResearchGate and Google CSE for author name disambiguation. Scientometrics 2017; 1113: 1965–85. [Google Scholar]

[ocz028-B3] 3. Yang K-H, Peng H-T, Jiang J-Y, et al. Author name disambiguation for citations using topic and web correlation. In: International Conference on Theory and Practice of Digital Libraries. Aarhus, Denmark: Springer; 2008. [Google Scholar]

[ocz028-B4] 4. Tan YF, Kan MY, Lee D.. Search engine driven author disambiguation. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries. Chapel Hill, NC, USA: ACM; 2006. [Google Scholar]

[ocz028-B5] 5. Yang K-H, Jiang J-Y, Lee H-M, et al. Extracting Citation Relationships from Web Documents for Author Disambiguation. Technical Report (TR-IIS-06-017). Taipei, Taiwan: Institute of Information Science, Academia Sinica; 2006. [Google Scholar]

[ocz028-B6] 6. Kanani P, McCallum A.. Efficient strategies for improving partitioning-based author coreference by incorporating Web pages as graph nodes. In: Proceedings of AAAI 2007 Workshop on Information Integration on the Web; 2007. [Google Scholar]

[ocz028-B7] 7. Lu Y, Nie Z, Cheng T, et al. Name disambiguation using web connection In: Proceedings of AAAI 2007 Workshop on Information Integration on the Web; 2007. [Google Scholar]

[ocz028-B8] 8. Song M, Kim EH-J, Kim HJ.. Exploring author name disambiguation on PubMed-scale. J Informetrics 2015; 94: 924–41. [Google Scholar]

[ocz028-B9] 9. Han H, Xu W, Zha H, et al. A hierarchical naive Bayes mixture model for name disambiguation in author citations In: Proceedings of the 2005 ACM Symposium on Applied Computing; 2005: 1065–9. [Google Scholar]

[ocz028-B10] 10. Torvik VI, Smalheiser NR.. Author name disambiguation in MEDLINE. ACM Trans Knowl Discov Data 2009; 33: 1.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocz028-B11] 11. Culotta A, Kanani P, Hall R, et al. Author disambiguation using error-driven machine learning with a ranking loss function In: Sixth International Workshop on Information Integration on the Web (IIWeb-07); 2007. [Google Scholar]

[ocz028-B12] 12. Ferreira AA, Veloso A, Gonçalves MA, et al. Effective self-training author name disambiguation in scholarly digital libraries In: Proceedings of the 10th Annual Joint Conference on Digital Libraries; 2010. [Google Scholar]

[ocz028-B13] 13. Huang J, Ertekin S, Giles CL.. Efficient name disambiguation for large-scale databases In: European Conference on Principles of Data Mining and Knowledge Discovery; 2006. [Google Scholar]

[ocz028-B14] 14. Han H, Giles L, Zha H, et al. Two supervised learning approaches for name disambiguation in author citations. Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries; 2004. [Google Scholar]

[ocz028-B15] 15. Treeratpituk P, Giles CL.. Disambiguating authors in academic publications using random forests In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries; 2009. [Google Scholar]

[ocz028-B16] 16. Liu W, Islamaj Doğan R, Kim S, et al. Author name disambiguation for PubMed. J Assoc Inf Sci Technol 2014; 654: 765–81. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocz028-B17] 17. Tang J, Fong AC, Wang B, et al. A unified probabilistic framework for name disambiguation in digital library. IEEE Trans Knowl Data Eng 2012; 246: 975–87. [Google Scholar]

[ocz028-B18] 18. Kang I-S, Kim P, Lee S, et al. Construction of a large-scale test set for author disambiguation. Inf Process Manage 2011; 473: 452–65. [Google Scholar]

[ocz028-B19] 19. Zhao D, Strotmann A.. Counting first, last, or all authors in citation analysis: A comprehensive comparison in the highly collaborative stem cell research field. J Am Soc Inf Sci 2011; 624: 654–76. [Google Scholar]

[ocz028-B20] 20. MacLean DL, Heer J.. Identifying medical terms in patient-authored text: a crowdsourcing-based approach. J Am Med Inform Assoc 2013; 206: 1120–7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocz028-B21] 21. Bravo À, Li TS, Su AI, et al. Combining machine learning, crowdsourcing and expert knowledge to detect chemical-induced diseases in text. Database 2016; 2016: baw094.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocz028-B22] 22. Vishnyakova D, Rodriguez-Esteban R, Ozol K, et al. Author name disambiguation in MEDLINE based on journal descriptors and semantic types In: Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016); 2016. [Google Scholar]

[ocz028-B23] 23. Strotmann A, Zhao D.. Author name disambiguation: what difference does it make in author‐based citation analysis? J Am Soc Inf Sci Tec 2012; 639: 1820–33. [Google Scholar]

[ocz028-B24] 24. Fleiss JL, Cohen J.. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educ Psychol Measur 1973; 333: 613–9. [Google Scholar]

[ocz028-B25] 25. Lerchenmueller MJ, Sorenson O.. Author disambiguation in pubmed: evidence on the precision and recall of author-ity among NIH-funded scientists. PLoS One 2016; 117: e0158731.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocz028-B26] 26. Cokol M, Rodriguez-Esteban R.. Visualizing evolution and impact of biomedical fields. J Biomed Inform 2008; 416: 1050–2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocz028-B27] 27. Rodriguez-Esteban R, Loging WT.. Quantifying the complexity of medical research. Bioinformatics 2013; 2922: 2918–24. [DOI] [PubMed] [Google Scholar]

PERMALINK

A new approach and gold standard toward author disambiguation in MEDLINE

Dina Vishnyakova

Raul Rodriguez-Esteban

Fabio Rinaldi