Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2011 Oct 22;2011:374–381.

Exploring the Use of Social Media to Measure Journal Article Impact

Perry Evans 1, Michael Krauthammer 1
PMCID: PMC3243242  PMID: 22195090

Abstract

Science blogs, Twitter commentary, and comments on journal websites represent an immediate response to journal articles, and may help in identifying relevant publications. However, the use of these media for establishing paper impact is not well studied. Using Wikipedia as a proxy for other social media, we explore the correlation between inclusion of a journal article in Wikipedia, and article impact as measured by citation count. We start by cataloging features of PubMed articles cited in Wikipedia. We find that Wikipedia pages referencing the most journal articles are about disorders and diseases, while the most referenced articles in Wikipedia are about genomics. We note that journal articles in Wikipedia have significantly higher citation counts than an equivalent random article subset. We also observe that articles are included in Wikipedia soon after publication. Our data suggest that social media may represent a largely untapped post-publication review resource for assessing paper impact.

Introduction

Keeping up with important scientific advances is difficult in light of an ever increasing amount of research articles. We are interested in the question of whether the near real-time nature of social media can be exploited to assess the importance of a journal article soon after publication. Post-publication peer review is becoming faster with an increase in tools that facilitate journal article discussions, such as Twitter and blogs that often feature discussions of recently published articles. While it has been observed that blog and Twitter comments play a role in shaping a paper’s reception1, we are not aware of prior work studying the relationship of article representation in social media and subsequent article impact. One can hypothesize that papers that are well received in the social media sphere are of broad interest and will spawn derivative work and publications. To study this hypothesis, we investigate article characteristics, such as citation count, of papers that are and are not discussed in social media. Given the dynamic nature of many social media, we decided to work with Wikipedia, as it provides a readily accessible archive of current and past user activities since its inception. We found that articles mentioned in Wikipedia have a greater propensity to be cited and to be listed in dedicated post-publication peer review sites such as Faculty of 10002. Furthermore, we observed that Wikipedia contributors add journal articles to Wikipedia soon after publication, with an ever shortening time-lag in recent years. These findings indicate the potential utility of monitoring Wikipedia updates to stay abreast of scientific advances as they appear in the literature.

Methods

We downloaded the 2010-01-30 dump of Wikipedia3, containing roughly nineteen million pages with complete edit history, and used xml2sql4 to convert the dump to SQL. In addition to the seven million pages in Wikipedia, there are Portal, Template, Talk, and User entries. Our main objective was to locate PubMed journal article citations in Wikipedia pages.

We searched for PubMed journal article references in the full history of each Wikipedia page by looking for citations with PubMed IDs (PMIDs). When a PMID was absent, we used DOIs and attempted to find a corresponding PMID using the ESearch tool from the Entrez Programming Utilities (eUtils)5 provided by NCBI. We found 161155 PubMed journal articles in 38811 Wikipedia articles. For each journal article PMID, we used the EFetch tool from eUtils to obtain journal article titles, abstracts, publication dates, MeSH terms, source journals, and source journal ISSNs. We determined the open access status of each journal by the presence of its ISSN in the Directory of Open Access Journals6.

We tested the hypothesis that PubMed journal articles in Wikipedia have higher impact, as determined by citation counts and scores from Faculty of 1000 (F1000), which is an online review site for journal articles, than an equivalent random subset. Since the Wikipedia dump is from January 2010, we limited this analysis to articles published before 2010. We obtained citation information for 411270 journal articles using PubMed Central (PMC). Of the 38811 PMIDs in Wikipedia, 18910 were present in PMC. We downloaded all 8099 pages from F1000 on March 9, 2011, and linked the F1000 journal article scores to journal article PMIDs through the journal article’s PubMed Entrez link provided by F1000. We obtained F1000 scores for 65848 PubMed journal articles published before 2010. 4905 of these PubMed journal articles were also cited in Wikipedia pages. To compare citation counts or F1000 scores for journal articles in Wikipedia to those for articles not listed in Wikipedia, we calculated the probability of obtaining similar counts or scores by drawing random journal articles from F1000 or PMC. To rule out time since publication as a bias, the random journal article sets were within the same publication years as the journal articles in Wikipedia.

Results

We began our exploration of social media as a tool for measuring journal article impact with a survey of Wikipedia pages that cite PubMed journal articles. 0.54% of Wikipedia pages cite a PubMed journal article. The Wikipedia page with the most journal article citations is Autophagy network, but this page was deleted from Wikipedia in January 2011 because it was considered original research. After removing this Wikipedia page, Alzheimer’s disease, with 372 journal article references, has the most citations. With the exceptions of Benzodiazepine (Valium), Medicinal mushrooms, and Reelin, the 10 Wikipedia pages with the most journal article citations are about diseases or disorders (Table 1). Even the protein reelin has an association with disorders, as it has been hypothesized to play a role in schizophrenia7, Alzheimer’s disease8, and autism9.

Table 1:

Wikipedia pages that cite the most PubMed journal articles.

Wikipedia Pages Journal Article Citations
Alzheimer’s disease 372
Lyme disease 234
Chronic fatigue syndrome 224
Benzodiazepine 199
Medicinal mushrooms 197
Crohn’s disease 197
Major depressive disorder 189
Eating disorders 183
Reelin 178
Multiple sclerosis 174

Limiting our analysis to Wikipedia pages citing at least one journal article, the distribution of the number of journal articles cited by a Wikipedia page shows that most Wikipedia pages cite few journal articles (Figure 1, left). Of the small fraction of Wikipedia pages that cite a journal article, each page cites an average of 6.7 journal articles, and 36% of Wikipedia pages cite only one journal article. Roughly 50 Wikipedia pages cite at least 100 journal articles, with almost 75% of Wikipedia pages citing less than 10 journal articles.

Figure 1:

Figure 1:

Left: Distribution of the number of PubMed journal articles cited by a Wikipedia page. This plot is limited to Wikipedia pages that cite at least one journal article. Right: Distribution of the number of Wikipedia pages that cite a particular PubMed journal article.

Properties of journal articles cited in Wikipedia

Moving from our Wikipedia page survey to an investigation of the journal articles cited in Wikipedia, we sought to determine what types of journal articles are represented in Wikipedia pages. We found 161155 PubMed journal articles cited in the histories of all Wikipedia articles. Estimating 20 million journal articles in PubMed10, only 0.08% of PubMed journal articles are cited in Wikipedia pages.

While the 10 Wikipedia pages citing the most journal articles are about disorders, the 10 most cited journal articles in Wikipedia are mostly about genomics (Table 2). These articles refer to high-throughput screens that implicate many genomic elements, such as genes, each with its own Wikipedia page. 19.5% of journal articles are cited by more than one Wikipedia page (Figure 1, right). 52% of journal articles cited in Wikipedia were published between 2001 and 2010 (Figure 2, left). More than one fourth of the most frequent article-linked MeSH terms are related to genetics (Table 3). Only 2.8% of journal articles cited in Wikipedia come from open access journals.

Table 2:

Most cited PubMed journal articles in Wikipedia pages.

Journal Article Citing Wikipedia Pages Publication Year
Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences 7159 2002
The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC) 4963 2004
Complete sequencing and characterization of 21,243 full-length human cDNAs 3230 2004
Towards a proteome-scale map of the human protein-protein interaction network 1343 2005
Oligo-capping: a simple method to replace the cap structure of eukaryotic mRNAs with oligoribonucleotides 1316 1994
Construction and characterization of a full length-enriched and a 5′-end-enriched cDNA library 1282 1997
Global, in vivo, and site-specific phosphorylation dynamics in signaling networks 734 2006
Diversification of transcriptional modulation: large-scale identification and characterization of putative alternative promoters of human genes 663 2006
Normalization and subtraction: two approaches to facilitate gene discovery 619 1996
DNA cloning using in vitro site-specific recombination 615 2000

Figure 2:

Figure 2:

Left: Number of journal article citations in Wikipedia by publication year. Right: Number of journal article citations in Wikipedia by journal names. Blue dots indicate open access journals. Journals are restricted to those with more than 100 articles in Wikipedia.

Table 3:

Top MeSH terms of journal articles cited in Wikipedia.

MeSH Term Wikipedia Pages
Humans 29048
Animals 22691
metabolism 22002
genetics 20080
Male 17894
physiology 17844
Female 17483
chemistry 17235
Molecular Sequence Data 15339
analysis 14357
Mice 13895
pharmacology 13784
Amino Acid Sequence 13229
methods 13017
Base Sequence 12786
Rats 12516
Adult 12380
drug effects 11644
Cloning, Molecular 11108
isolation & purification 10509
RNA, Messenger 10377
DNA, Complementary 10244
biosynthesis 10049

There are roughly 6500 journals represented in Wikipedia. 1754 of these are cited more than 10 times, and 241 are cited more than 100 times (Figure 2, right). The top 10 most cited journals in Wikipedia are shown in Table 4. None of these journals are open access. There are only 444 open access journals cited in Wikipedia pages. 108 are cited more than 10 times, and 6 are cited more than 100 times. The open access journal with the most citations in Wikipedia is Nucleic Acids Research.

Table 4:

Most cited journals in Wikipedia pages.

Journal Wikipedia Citations
J Biol Chem 12700
Proc Natl Acad Sci U S A 5341
Science 3545
Nature 3001
Biochem Biophys Res Commun 2947
Genomics 2604
Mol Cell Biol 1984
FEBS Lett 1611
Biochim Biophys Acta 1588
Oncogene 1491

Impact of journal articles cited in Wikipedia

Our hypothesis is that journal articles with far-reaching implications or general importance are included in Wikipedia, and that other articles are ignored. In other words, users of Wikipedia add an article citation if the article either establishes a new research topic, and thus triggers the generation of a novel Wikipedia page, or reports novel and non-marginal findings that are added to an existing Wikipedia page. If the hypothesis holds, then journal articles cited in Wikipedia pages will have higher impact scores than an equivalent random subset. To evaluate whether article inclusion in Wikipedia correlates with journal article impact, we compared it to two forms of impact scores, journal article citation counts and Faculty of 1000 (F1000) scores. While citation counts are an established proxy of the journal article impact, we sought to expand the analysis by F1000 scores, which represent experts’ opinions about the importance of a particular article.

We used PubMed Central (PMC) to find the number of citing journal articles for roughly nineteen thousand journal articles cited in Wikipedia pages. We used F1000 to find impact scores for about five thousand journal articles cited in Wikipedia pages. Using both PMC citation counts and F1000 scores as journal article impact measures, we found that journal articles cited in Wikipedia pages had significantly higher citation counts and F1000 scores than an equivalent random subset (p-value < 0.001 for both impact measures). In Figure 3, we also show the likelihood for article inclusion in Wikipedia given its citation count and F1000 score. As those two measures increase, up to 25% of articles from PMC and F1000 are included in Wikipedia.

Figure 3:

Figure 3:

Correlation between PubMed journal articles cited in Wikipedia and journal article impact, measured by a journal article’s citation count, and its Faculty of 1000 (F1000) score. The plot shows the percent of journal articles with a particular impact score that are cited in Wikipedia. The higher a journal article’s impact score, the more likely that the journal article will be cited in Wikipedia. F1000 scores were rounded to 5, and citation counts were rounded to 10. Only impact scores supported by 30 journal articles are shown.

Having established that Wikipedia can serve as a filter for high impact journal articles, we tackled the question of the timeliness of article inclusion in Wikipedia. To address this issue, we asked how soon after publication journal articles were listed in Wikipedia. We determined that most journal articles published past 2009 were included in Wikipedia within the first few months after publication (Figure 4, left). Furthermore, we found that Wikipedia contributors were becoming faster at including journal articles as time progressed (Figure 4, right).

Figure 4:

Figure 4:

Left: Months until journal article inclusion in Wikipedia for journal articles published from 2007 to 2010. Right: Months until journal article inclusion for articles published from 2004 to 2010. The inclusion time is limited to twelve months for a fair comparison across years. There is a visible trend for journal articles published in later years to be included in Wikipedia faster than journal articled published in earlier years.

Discussion

We investigated the use of Wikipedia for measuring the impact of PubMed journal articles. Using journal article citations and Faculty of 1000 scores as separate measures for journal impact, we determined that journal articles in Wikipedia have higher impact scores than an equivalent random article set. Thus, Wikipedia acts as a filter for high impact journal articles. We also used Wikipedia histories to find the exact date that each journal article was entered into Wikipedia, allowing us to calculate the lag between journal article publication and inclusion in Wikipedia. We found that most new journal articles are added to Wikipedia soon after they are published. Together, these data support our hypothesis that users of Wikipedia add article citations for publications with high importance and relevance, with the little lag time between publication and inclusion in Wikipedia. It should be noted, however, that each social media has its own user community, which will ultimately bias the type of articles discussed. From our survey of journal article citations in Wikipedia, we see that diseases and disorders are heavily researched in Wikipedia, and the corresponding pages are likely better curated. We found that the most cited topics in Wikipedia were related to genomics, which may further limit the generalizability of our findings. We would also like to note that we did not study feed-back loops that may exist between article impact measures and inclusion of articles in Wikipedia. It may well be that articles mentioned in Wikipedia trigger interest that subsequently influence the number of citations. Finally, we did not consider further article impact measures, such as journal article downloads and other usage metrics11, which may be equally informative to measure article relevance. Despite these shortcomings, we believe that our data support the notion that PubMed articles listed in Wikipedia are of interest and relevance.

Conclusion

We conclude that Wikipedia selectively lists high impact journal articles soon after they are published, and thus represents a useful resource for identifying relevant articles. We hope that our study will spur further exploration of using social media for measuring journal paper reception and impact.

References

  • 1.Mandavilli A. Peer review: Trial by Twitter. Nature. 2011;469(7330):286. doi: 10.1038/469286a. [DOI] [PubMed] [Google Scholar]
  • 2.Wets K, Weedon D, Velterop J. Post-publication filtering and evaluation: Faculty of 1000. Learned Publishing. 2003;16(4):249–258. [Google Scholar]
  • 3.Wikipedia dump. Available at http://dumps.wikimedia.org/enwiki/20100130/pages-meta-history.xml.7z.
  • 4.xml2sql. Available at http://meta.wikimedia.org/wiki/xml2sql.
  • 5.Sayers E, Wheeler D, National Center for Biotechnology Information (US) Building Customized Data Pipelines Using the Entrez Programming Utilities (eUtils) NCBI; 2004. [Google Scholar]
  • 6.Singh J. Directory of open access journals. Indian Journal of Pharmacology. 2005;37(3):198. [Google Scholar]
  • 7.Fatemi SH, Earle JA, McMenomy T. Reduction in Reelin immunoreactivity in hippocampus of subjects with schizophrenia, bipolar disorder and major depression. Molecular psychiatry. 2000;5(6):654. doi: 10.1038/sj.mp.4000783. [DOI] [PubMed] [Google Scholar]
  • 8.Botella-López A, Burgaya F, Gavín R, García-Ayllón M, Gómez-Tortosa E, Peña-Casanova J, Ureña JM, Del Río JA, Blesa R, Soriano E, et al. National Acad Sciences. 2006. Reelin expression and glycosylation patterns are altered in Alzheimers disease. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Fatemi SH. The role of Reelin in pathology of autism. Molecular psychiatry. 2002;7(9):919. doi: 10.1038/sj.mp.4001248. [DOI] [PubMed] [Google Scholar]
  • 10.PubMed. Total citation count at http://www.ncbi.nlm.nih.gov/pubmed/.
  • 11.Bollen J, Rodriguez MA, Van de Sompel H. Mesur: usage-based metrics of scholarly impact. Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries; ACM; 2007. pp. 474–474. [Google Scholar]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES