Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2020 Mar 4;2019:655–663.

Identification and Ranking of Biomedical Informatics Researcher Citation Statistics through a Google Scholar Scraper

Allison B McCoy 1, Dean F Sittig 2, Jimmy Lin 3, Adam Wright 4
PMCID: PMC7153158  PMID: 32308860

Abstract

To overcome limitations of previously developed scientific productivity ranking services, we created the Biomedical Informatics Researchers ranking website (rank.informatics-review.com). The website is composed of four key components that work together to create the automatically updating ranking website: 1) list of biomedical informatics researchers, 2) Google Scholar scraper, 3) display page, and 4) updater. The interactive website has facilitated identification of leaders in each of the key citation statistics categories (i.e., number of citations, h-index, and i10-index), and it has allowed other groups, such as tenure and promotions committees, to more effectively and efficiently evaluate researchers and interpret the various citation statistics reported by candidates. Creation of the biomedical informatics researcher ranking website highlights the vast differences in scholarly productivity among members of the biomedical informatics research community. Future efforts are underway to add new functionality to the website and to expand the work to identify top papers in biomedical informatics.

Introduction

Citation statistics have been used to measure scholarly production of researchers since 1964, when Eugene Garfield created the Science Citation Index.1 Since then, additional measures have been developed in attempt to quantify research productivity and scientific impact independent of a researcher’s field of study and years at work, and without inflation from a small number of highly cited articles. Some services have attempted to rank research productivity, including ResearchGate,2 SciVal,3 and Highly Cited Researchers.4 However, these services have limitations, such as reliance on proprietary metrics, inclusion of only a limited number of highly ranked researchers, and requirements of organizational commitment to commercial sites.

Google created a new paradigm for citation analysis with Google Scholar,5 an online, freely available, automatically updating scientific information resource. By allowing researchers to create their own profile page,6 complete with multiple bibliometric calculations (e.g., total citations, h-index, i-10 index), Google provides a potential new method that allows comparison across researchers. Comparisons of Google Scholar, Web of Science, and Scopus have found variations in citation statistics, where Google Scholar was frequently found to be considerably higher than Web of Science and Scopus due to Google Scholar’s wider inclusion of conference papers and gray literature.7,8 Despite these limitations, the format and availability of Google Scholar Profiles prove advantageous.

To facilitate comparison across researchers, we created a ranking website based on Google Scholar profiles for bio-medical informatics researchers.9 This site was based on code previously developed by one author [JL] and used to create ranking sites for information retrieval,10 human-computer interaction,11 and top computer science researchers12 using a straightforward website scraping approach. These sites computed additional metrics to normalize Google Scholar’s bibliometric measures by dividing each of them by the number of years since the researcher received his or her first citation (i.e., citations/year, h-index/year, i-10 index/year).13

Design Considerations

Our goal in creating the Biomedical Informatics Researchers ranking website was to produce a freely available, automatically updating information resource based on Google Scholar citation profiles for all individuals interested in the field of biomedical informatics. Creating this resource required us to:

  • Identify a set of biomedical informatics researchers with publicly-available Google Scholar profiles.

  • Develop efficient methods to scrape the Google Scholar citation profiles of this list of individuals and extract key citation metrics.

  • Implement a method to render the ranked list of researchers along with the means to re-order and search the list.

  • Produce a method to allow new biomedical informatics researchers to add their name and google scholar profile location to the ranking site.

  • Develop a method to automatically update the ranking site with the latest Google Scholar results on a periodic basis.

System Description

A snapshot of the ranking website as of March 11, 2019 is depicted in Figure 1. The website is composed of four key components that all work together to create the automatically updating ranking site: list of researchers, Google Scholar scraper, display page, and updater.

Figure 1:

Figure 1:

Top 15 biomedical informatics researchers as of March 11, 2019

The first is the list of biomedical informatics researchers. The list file is in JSON (JavaScript Object Notation) format, with name / URL pairs represented as:

Full Name [FACMI] [FAMIA] [FIAHSI] [Collen Year]”: “Google Scholar URL

where:

  • “FACMI” is an optional indicator that designates that they are member of the American College of Medical Informatics (ACMI).14

  • “FAMIA” is an optional indicator that designates that they are a fellow of the American Medical Informatics Association (AMIA).15

  • “FIAHSI” is an optional indicator that designates that they are a member of the International Academy of Health Sciences Informatics (IAHSI).16

  • “Collen Year” is an optional indicator that designates that they are a recipient of the Morris F. Collen Award, and the year of their award.17

The FACMI, FAMIA, FIAHSI, and Collen Year components are appended to the name to be displayed on the ranked list of researchers. The list was created initially through an iterative process that began with manual searches for known biomedical informaticians using Google Scholar. After approximately 100 researchers were identified, we realized that we needed an automated method to develop a more comprehensive list. Therefore, in 2014, we used the “label:biomedical_informatics” search feature that identified all individuals on Google Scholar with “biomedical informatics” as one of their “areas of interest” and at least one publication with one or more citations. We repeated this search using “label:medical_informatics” and other common key words related to biomedical informatics, including health informatics, electronic health record, clinical decision support, and health information technology.

To facilitate new requests to be added to the list of biomedical informatics researchers, we created a Google form that allows a researcher to requested to add his or her profile to the ranking website. The Google form prompts researchers to enter his or her name and Google Scholar URL; to indicate whether they are an ACMI fellow, AMIA fellow, or member of IAHSI; and to indicate whether they have received the Morris F. Collen award and, if so, to enter the year. The input data is manually verified to be accurate by one of the co-authors to prevent errors in running the scraper, then it is added to the list of biomedical informatics researchers to be included on the site with the next update. Since then, we have periodically solicited requests for individuals to add their profiles (or create one if they had not already done so) through the ACMI listserv and other targeted e-mailings (e.g., to department listservs, through AMIA Connect). We have also manually added new profiles found through repeated Google Scholar searches of relevant labels and review of new individuals listed on ACMI, FAMIA, and IAHSI member lists.

The second component is the Google Scholar Scraper. This open-source application is written in node.js and built using commonly-available open-source libraries. It takes as input the list of researchers and then iteratively retrieves the listing of each person’s google scholar citation counts, the total number of citations, the year of first citation, the i-10 index, and the h-index. These values are extracted based on matching the relevant elements from each page’s DOM (Document Object Model) structure. This approach makes the scraper application dependent on the layout of the Google Scholar profile page, so it is not robust to changes in the layout of the profiles, and indeed, the application has broken several times since the initial development in 2014 after Google updated its site. However, no APIs (Application Programming Interfaces) that allow programmatic access to such data are available, so there are few alternatives to this screen-scraping approach.

In addition to extracting raw statistics from profile pages, the application also calculates the citations/year, i-10 in- dex/year, and h-index/year; all computed values are written into a file in JSON format, which facilitates the display as well as downstream processing by other applications. The following is a brief definition of each of the bibliometric measures included on the ranking site:

  • Total number of citations – the total number of citations to all of a researcher’s published articles

  • Year of first citation – the year in which the researcher received his or her first citation, regardless of the year of publication of their first article

  • i-10-index – the number of articles that a researcher has published that have received at least 10 citations

  • h-index – the number of articles (n) that a researcher has published that have received at least “h” citations where n=h.18 In other words, if a research has published 25 articles that have all received at least 25 citations, then his or her h-index is 25.

  • Citations/year – a researcher’s total number of citations divided by the number of years in which he or she has been accumulating citations (i.e., current year – year of first citation)

  • i10-index/year – a researcher’s i-10-index divided by the number of years in which he or she has been accumulating citations

  • h-index/year – a researcher’s h-index divided by the number of years in which he or she has been accumulating citations

The third component is the display page that renders the JSON data created by the scraper program above in HTML/CSS, (Hypertext Markup Language / Cascading Style Sheets) with the aid of JavaScript. The display lists the researchers in ranked order and allows a user to re-sort the entire list by any of the column headers (e.g., citations or i-10 index). The display page also incorporates a search feature that allows one to display a ranked subset of research ers, for example: 235 ACMI members, the 32 people associated with “Vanderbilt University”, or the 26 people with “David” in either their name or affiliation. This page also includes code that allows Google Analytics to track website traffic.

The final component is the updater, a script that periodically re-runs the scraper and pushes the updated data to the web site and Github. The updater is currently set to run twice per week. Several error conditions are periodically encountered and detected by the updater script, including instances where people delete their Google Scholar profile or make it private, network issues that prevent connection to Google Scholar or Github, or temporary blocks imposed by Google. Although Google permits scraping of Google Scholar profiles in their robot exclusion standard (robots.txt) file, they do periodically block the scraper if it is set to run too often.

Current Status of Biomedical Informatics Researcher Ranking Website

The Citation Statistics of Biomedical Informatics Researchers ranking website can be viewed at rank.informatics- review.com. In the nearly five years since its inception, the website has been viewed more than 18,000 times by almost 9,000 users. Of these users, 70% reside in the United States, 6% in India, 2.5% in Australia, 2.4% in Canada, 1.7% in the United Kingdom, and 1% in China. We observed apparent spikes in website traffic in several instances after listserv e-mails were sent or individual researchers mentioned the website on social media (Figure 2). For example, timepoint #1 corresponds with an e-mail sent to the ACMI listserv, and timepoint #4 corresponds with a Tweet by @allisonbmccoy.

Figure 2:

Figure 2:

Visitors to the biomedical informatics ranking website from Google Analytics with traffic spikes corresponding to known instances of dissemination.

The list of biomedical informatics researchers contains 1,401 individuals, including 235 ACMI fellows, 62 AMIA fellows, 61 IAHSI members, and 12 Morris F. Collen award winners. Requests to be added to the site have been submitted through the Google form for 171 researchers.

Since the BMI ranking list has been available, numerous uses for the list have been identified, including:

  • To create a list of members from a single university and compare the scholarly productivity of those university’s biomedical informatics departments. To our knowledge, at least three universities are currently using the website in annual department reviews.19

  • To identify productive researchers for nomination to ACMI or IAHSI members.

  • To identify potential recruits for academic positions.

  • To help tenure and promotions committees to interpret the various citation statistics reported by candidates.

  • To identify speakers for conferences.

  • To identify subfields of biomedical informatics for which citations are highest.20

Notable Statistics for Biomedical Informatics Researchers

Table 1 shows the median, min, and max values for all biomedical informatics researchers as well as for all ACMI, FAMIA, and IAHSI members and Morris F. Collen award winners, as identified through the biomedical informatics ranking website as of March 4, 2019. As expected, the median ACMI and IAHSI members (h-index=35.5 and 41, respectively) have been publishing for 8-10 years longer than the median for all researchers (h-index: 15). The median h-index for AMIA fellows (14) is similar to the median for all researchers, which is also expected given that FAMIA recognition is based on application of informatics skills and knowledge, regardless of research productivity. Table 2 shows the values for 10 randomly chosen Nobel Prize winning scientists (median h-index: 120) as an upper extreme for comparison.22

Table 1:

Descriptive analysis of citation statistics for biomedical informatics researchers

All Biomedical Informatics Researchers (N=1,401) ACMI Fellows (N=235) AMIA Fellows (N=62) IAHSI Members (N=61) Morris F. Collen Award Winners (N=12)
Year of 1st Citation Median 2004 1996 2005 1994 1980
Min 1980 1980 1983 1980 1984
Max 2017 2009 2012 2006 1997
Total Citations Median 1,028 5389 812.5 7145 9,964.5
Min 2 300 72 247 4,456
Max 166,410 166,410 24,324 108,929 108,929
Citations/ year Median 68 237 55 274 302
Min 0 10 5 15 117
Max 9,958 6,400 950 4,951 4,951
h-index Median 15 36 14 41 46
Min 1 9 3 7 31
Max 199 149 72 149 149
h-index/ year Median 1 1.6 1 1.5 1.6
Min 0.1 0.2 0.2 0.5 0.8
Max 15.6 7.7 3.3 6.8 6.8
i10-index Median 20 77 17.5 117 127.5
Min 0 8 1 6 77
Max 922 802 252 695 695
i10-index/ year Median 1.4 3.4 1.2 4.6 4.35
Min 0 0.2 0.1 0.4 2
Max 51.3 33.9 9 31.6 31.6

ACMI = American College of Medical Informatics, AMIA = American Medical Informatics Association, IAHSI = International Academy of Health Sciences Informatics

Table 2:

Convenience sample of 10 Nobel Prize winners’ Google Scholar citation statistics (as of March 11, 2019)

Name Nobel Prize (Year) Citations h-index i10-index
Gerhard Ertl Chemistry (2007) 71,475 132 573
Michael Levitt Chemistry (2013) 38,232 91 181
Herbert A. Simon Economics (1978) 338,316 172 554
Paul R. Krugman Economics (2008) 217,039 159 862
Christopher A. Sims Economics (2011) 57,963 77 157
Alvin E. Roth Economics (2012) 47,211 95 214
Eugene F. Fama Economics (2013) 266,441 107 192
Jean Tirole Economics (2014) 128,261 134 306
Albert Einstein Physics (1921) 241,716 201 800
Yoichiro Nambu Physics (2008) 25,775 52 93
Median values 99,868 120 260

In reviewing the citation ranking statistics on the website (Figure 1) and changing the primary sort order by column, we have made a number of interesting observations about the list:

  • Eugene Koonin from National Center for Biotechnology Information has the most citations (166,410) and the highest h-index (199).

  • Twenty-nine researchers are tied for the earliest “year of first citation” (1980). In reviewing these results, we noted that this is a limitation in the Google Scholar profile page, with no citations depicted prior to this date, though prior publication dates are listed on individual researchers’ pages (e.g., Homer Warner, 195121).

  • Alex Wang from National Institutes of Health has the highest i10-index (922).

  • Brian Pollack from University of Pittsburgh has the highest citations/year (9,958), h-index/year (15.6), and i10- index/year (51.3).

  • ACMI members make up 55 of the top 100 researchers when sorted by both h-index and citations, and IAHSI members make up 20/100.

To evaluate the included citation statistics, we calculated the correlation coefficient between the h-index and total citations (r2=0.77) (Figure 3) and i10-index (r2=0.89) (Figure 4) using Stata/IC 15.1. Overall, the statistics similarly portray researcher productivity; however, in one case a researcher has a disproportionately high total citation count compared to h-index due to a single paper with more than 100,000 citations.

Figure 3:

Figure 3:

Graph showing the relationship between h-index and total citations (r2=0.77).

Figure 4:

Figure 4:

Graph showing the relationship between h-index and i10-index (r2=0.89).

Lessons Learned

Creation of the biomedical informatics researcher ranking website highlights the vast differences in scholarly productivity among members of the biomedical informatics research community. Careful inspection of the citations included on many researchers’ profile pages also highlights many of the limitations of automatically curated lists, including:

  • For individuals with relatively common family names, the inclusion of articles that were authored by other researchers are often included erroneously, which can falsely inflate citation statistics and rankings.23 Authors can curate their own profiles to remove erroneous citations, but few do.

  • Duplicate citations exist in many profiles that can also false inflate citation statistics and rankings;23 however, Google Scholar has implemented functionality to automatically merge some articles and combine citations when authors do not manually merge citations.

  • Researchers with publications before the 1990’s, when use of the internet substantially increased, are not as well included in the various citation statistics. Most notable is that there are no citations included before 1980 in any of the counts, an important limitation of Google Scholar profile page and the scraper tool.

  • Highly cited publications by large consortia, including the “Initial Sequencing and analysis of the human ge- nome”24 and “Observation of a new boson at a mass of 125 GeV with the CMS experiment at the LHC”25 heavily skew some authors’ citation statistics.

  • Not all articles included are equal, although Google Scholar often lists blog posts and slide presentations with articles from peer-reviewed scientific journals.

  • Likewise, not all citations are equal, although Google Scholar counts all citations equally, whether from a website, slide presentation, or top scientific journal.

  • Most indexed articles are in English, which negatively affects non-English speaking researchers.

Future Directions

While we believe the current biomedical informatics researcher ranking site is already very useful, we are continuing to identify new researchers, especially those who are highly cited, ACMI fellows, members of IAHSI, or Morris F. Collen award winners. In addition, we are reviewing profiles with a large number of incorrect or duplicate citations and requesting that the individuals curate their profile or be removed from the list. We have also identified numerous enhancements that we hope to make in the future, including:

  • Adding the total number of articles included in each person’s Google Scholar profile and the year of first publication to the biomedical informatics researcher ranking website.

  • Adding an indicator for other noteworthy accomplishments, including AMIA signature awards (e.g., Donald A.B. Lindberg Award for Innovation in Informatics, Virginia K. Saba Informatics Award, and AMIA New Investigator Award).

  • Calculating the longest consecutive string of years in which each researcher published one or more articles that received one or more citations.26

  • Calculating hs (universal h-index), or the h-index of an individual divided by the mean h-index of everyone in the field.27

  • Evaluating and improving the usability and efficiency of the site.

Finally, we are exploring opportunities to use the current Google Scholar scraper to identify top papers in biomedical informatics for all time and in the past year. A preliminary version of this new tool retrieved the top 100 most cited publications with 100 or more citations from all profiles in the list of biomedical informatics researchers and found 7,429 papers that met these criteria. The top most cited publication had 69,812 citations total with 3,173 citations per year.28 A preliminary version of the tool to identify top papers in the last year retrieved all publications in 2018 from all profiles in the list of biomedical researchers and found 3,751 publications. The top most cited publication had 2,177 citations total.29 At present, several limitations to this new tool exist. One important limitation is the inclusion of all publication types; in 2018, the most cited publication was a textbook. Another limitation is the inclusion of papers published by biomedical informatics researchers in areas that are not directly related to biomedical informatics; for example, an American Heart Association report on which a biomedical informatics researcher played a small role related to informatics development or data analysis is the second most cited publication in 2018.30

Conclusion

We have developed an easily searchable, interactive, automatically updating, open-source bibliometric ranking website using Google Scholar citation profiles that includes over 1,300 biomedical informatics researchers from around the world. While there are limitations to both using bibliometric citation analysis to measure scientific productivity and automatically generated lists of articles and citations, the biomedical informatics ranking website has already proven to be useful for a number of important tasks. Future efforts are underway to add new functionality to the website and to expand the work to identify top papers in biomedical informatics.

Figures & Table

References


Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES