Abstract
Science blogs, Twitter commentary, and comments on journal websites represent an immediate response to journal articles, and may help in identifying relevant publications. However, the use of these media for establishing paper impact is not well studied. Using Wikipedia as a proxy for other social media, we explore the correlation between inclusion of a journal article in Wikipedia, and article impact as measured by citation count. We start by cataloging features of PubMed articles cited in Wikipedia. We find that Wikipedia pages referencing the most journal articles are about disorders and diseases, while the most referenced articles in Wikipedia are about genomics. We note that journal articles in Wikipedia have significantly higher citation counts than an equivalent random article subset. We also observe that articles are included in Wikipedia soon after publication. Our data suggest that social media may represent a largely untapped post-publication review resource for assessing paper impact.
Introduction
Keeping up with important scientific advances is difficult in light of an ever increasing amount of research articles. We are interested in the question of whether the near real-time nature of social media can be exploited to assess the importance of a journal article soon after publication. Post-publication peer review is becoming faster with an increase in tools that facilitate journal article discussions, such as Twitter and blogs that often feature discussions of recently published articles. While it has been observed that blog and Twitter comments play a role in shaping a paper’s reception1, we are not aware of prior work studying the relationship of article representation in social media and subsequent article impact. One can hypothesize that papers that are well received in the social media sphere are of broad interest and will spawn derivative work and publications. To study this hypothesis, we investigate article characteristics, such as citation count, of papers that are and are not discussed in social media. Given the dynamic nature of many social media, we decided to work with Wikipedia, as it provides a readily accessible archive of current and past user activities since its inception. We found that articles mentioned in Wikipedia have a greater propensity to be cited and to be listed in dedicated post-publication peer review sites such as Faculty of 10002. Furthermore, we observed that Wikipedia contributors add journal articles to Wikipedia soon after publication, with an ever shortening time-lag in recent years. These findings indicate the potential utility of monitoring Wikipedia updates to stay abreast of scientific advances as they appear in the literature.
Methods
We downloaded the 2010-01-30 dump of Wikipedia3, containing roughly nineteen million pages with complete edit history, and used xml2sql4 to convert the dump to SQL. In addition to the seven million pages in Wikipedia, there are Portal, Template, Talk, and User entries. Our main objective was to locate PubMed journal article citations in Wikipedia pages.
We searched for PubMed journal article references in the full history of each Wikipedia page by looking for citations with PubMed IDs (PMIDs). When a PMID was absent, we used DOIs and attempted to find a corresponding PMID using the ESearch tool from the Entrez Programming Utilities (eUtils)5 provided by NCBI. We found 161155 PubMed journal articles in 38811 Wikipedia articles. For each journal article PMID, we used the EFetch tool from eUtils to obtain journal article titles, abstracts, publication dates, MeSH terms, source journals, and source journal ISSNs. We determined the open access status of each journal by the presence of its ISSN in the Directory of Open Access Journals6.
We tested the hypothesis that PubMed journal articles in Wikipedia have higher impact, as determined by citation counts and scores from Faculty of 1000 (F1000), which is an online review site for journal articles, than an equivalent random subset. Since the Wikipedia dump is from January 2010, we limited this analysis to articles published before 2010. We obtained citation information for 411270 journal articles using PubMed Central (PMC). Of the 38811 PMIDs in Wikipedia, 18910 were present in PMC. We downloaded all 8099 pages from F1000 on March 9, 2011, and linked the F1000 journal article scores to journal article PMIDs through the journal article’s PubMed Entrez link provided by F1000. We obtained F1000 scores for 65848 PubMed journal articles published before 2010. 4905 of these PubMed journal articles were also cited in Wikipedia pages. To compare citation counts or F1000 scores for journal articles in Wikipedia to those for articles not listed in Wikipedia, we calculated the probability of obtaining similar counts or scores by drawing random journal articles from F1000 or PMC. To rule out time since publication as a bias, the random journal article sets were within the same publication years as the journal articles in Wikipedia.
Results
We began our exploration of social media as a tool for measuring journal article impact with a survey of Wikipedia pages that cite PubMed journal articles. 0.54% of Wikipedia pages cite a PubMed journal article. The Wikipedia page with the most journal article citations is Autophagy network, but this page was deleted from Wikipedia in January 2011 because it was considered original research. After removing this Wikipedia page, Alzheimer’s disease, with 372 journal article references, has the most citations. With the exceptions of Benzodiazepine (Valium), Medicinal mushrooms, and Reelin, the 10 Wikipedia pages with the most journal article citations are about diseases or disorders (Table 1). Even the protein reelin has an association with disorders, as it has been hypothesized to play a role in schizophrenia7, Alzheimer’s disease8, and autism9.
Table 1:
Wikipedia Pages | Journal Article Citations |
---|---|
Alzheimer’s disease | 372 |
Lyme disease | 234 |
Chronic fatigue syndrome | 224 |
Benzodiazepine | 199 |
Medicinal mushrooms | 197 |
Crohn’s disease | 197 |
Major depressive disorder | 189 |
Eating disorders | 183 |
Reelin | 178 |
Multiple sclerosis | 174 |
Limiting our analysis to Wikipedia pages citing at least one journal article, the distribution of the number of journal articles cited by a Wikipedia page shows that most Wikipedia pages cite few journal articles (Figure 1, left). Of the small fraction of Wikipedia pages that cite a journal article, each page cites an average of 6.7 journal articles, and 36% of Wikipedia pages cite only one journal article. Roughly 50 Wikipedia pages cite at least 100 journal articles, with almost 75% of Wikipedia pages citing less than 10 journal articles.
Properties of journal articles cited in Wikipedia
Moving from our Wikipedia page survey to an investigation of the journal articles cited in Wikipedia, we sought to determine what types of journal articles are represented in Wikipedia pages. We found 161155 PubMed journal articles cited in the histories of all Wikipedia articles. Estimating 20 million journal articles in PubMed10, only 0.08% of PubMed journal articles are cited in Wikipedia pages.
While the 10 Wikipedia pages citing the most journal articles are about disorders, the 10 most cited journal articles in Wikipedia are mostly about genomics (Table 2). These articles refer to high-throughput screens that implicate many genomic elements, such as genes, each with its own Wikipedia page. 19.5% of journal articles are cited by more than one Wikipedia page (Figure 1, right). 52% of journal articles cited in Wikipedia were published between 2001 and 2010 (Figure 2, left). More than one fourth of the most frequent article-linked MeSH terms are related to genetics (Table 3). Only 2.8% of journal articles cited in Wikipedia come from open access journals.
Table 2:
Journal Article | Citing Wikipedia Pages | Publication Year |
---|---|---|
Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences | 7159 | 2002 |
The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC) | 4963 | 2004 |
Complete sequencing and characterization of 21,243 full-length human cDNAs | 3230 | 2004 |
Towards a proteome-scale map of the human protein-protein interaction network | 1343 | 2005 |
Oligo-capping: a simple method to replace the cap structure of eukaryotic mRNAs with oligoribonucleotides | 1316 | 1994 |
Construction and characterization of a full length-enriched and a 5′-end-enriched cDNA library | 1282 | 1997 |
Global, in vivo, and site-specific phosphorylation dynamics in signaling networks | 734 | 2006 |
Diversification of transcriptional modulation: large-scale identification and characterization of putative alternative promoters of human genes | 663 | 2006 |
Normalization and subtraction: two approaches to facilitate gene discovery | 619 | 1996 |
DNA cloning using in vitro site-specific recombination | 615 | 2000 |
Table 3:
MeSH Term | Wikipedia Pages |
---|---|
Humans | 29048 |
Animals | 22691 |
metabolism | 22002 |
genetics | 20080 |
Male | 17894 |
physiology | 17844 |
Female | 17483 |
chemistry | 17235 |
Molecular Sequence Data | 15339 |
analysis | 14357 |
Mice | 13895 |
pharmacology | 13784 |
Amino Acid Sequence | 13229 |
methods | 13017 |
Base Sequence | 12786 |
Rats | 12516 |
Adult | 12380 |
drug effects | 11644 |
Cloning, Molecular | 11108 |
isolation & purification | 10509 |
RNA, Messenger | 10377 |
DNA, Complementary | 10244 |
biosynthesis | 10049 |
There are roughly 6500 journals represented in Wikipedia. 1754 of these are cited more than 10 times, and 241 are cited more than 100 times (Figure 2, right). The top 10 most cited journals in Wikipedia are shown in Table 4. None of these journals are open access. There are only 444 open access journals cited in Wikipedia pages. 108 are cited more than 10 times, and 6 are cited more than 100 times. The open access journal with the most citations in Wikipedia is Nucleic Acids Research.
Table 4:
Journal | Wikipedia Citations |
---|---|
J Biol Chem | 12700 |
Proc Natl Acad Sci U S A | 5341 |
Science | 3545 |
Nature | 3001 |
Biochem Biophys Res Commun | 2947 |
Genomics | 2604 |
Mol Cell Biol | 1984 |
FEBS Lett | 1611 |
Biochim Biophys Acta | 1588 |
Oncogene | 1491 |
Impact of journal articles cited in Wikipedia
Our hypothesis is that journal articles with far-reaching implications or general importance are included in Wikipedia, and that other articles are ignored. In other words, users of Wikipedia add an article citation if the article either establishes a new research topic, and thus triggers the generation of a novel Wikipedia page, or reports novel and non-marginal findings that are added to an existing Wikipedia page. If the hypothesis holds, then journal articles cited in Wikipedia pages will have higher impact scores than an equivalent random subset. To evaluate whether article inclusion in Wikipedia correlates with journal article impact, we compared it to two forms of impact scores, journal article citation counts and Faculty of 1000 (F1000) scores. While citation counts are an established proxy of the journal article impact, we sought to expand the analysis by F1000 scores, which represent experts’ opinions about the importance of a particular article.
We used PubMed Central (PMC) to find the number of citing journal articles for roughly nineteen thousand journal articles cited in Wikipedia pages. We used F1000 to find impact scores for about five thousand journal articles cited in Wikipedia pages. Using both PMC citation counts and F1000 scores as journal article impact measures, we found that journal articles cited in Wikipedia pages had significantly higher citation counts and F1000 scores than an equivalent random subset (p-value < 0.001 for both impact measures). In Figure 3, we also show the likelihood for article inclusion in Wikipedia given its citation count and F1000 score. As those two measures increase, up to 25% of articles from PMC and F1000 are included in Wikipedia.
Having established that Wikipedia can serve as a filter for high impact journal articles, we tackled the question of the timeliness of article inclusion in Wikipedia. To address this issue, we asked how soon after publication journal articles were listed in Wikipedia. We determined that most journal articles published past 2009 were included in Wikipedia within the first few months after publication (Figure 4, left). Furthermore, we found that Wikipedia contributors were becoming faster at including journal articles as time progressed (Figure 4, right).
Discussion
We investigated the use of Wikipedia for measuring the impact of PubMed journal articles. Using journal article citations and Faculty of 1000 scores as separate measures for journal impact, we determined that journal articles in Wikipedia have higher impact scores than an equivalent random article set. Thus, Wikipedia acts as a filter for high impact journal articles. We also used Wikipedia histories to find the exact date that each journal article was entered into Wikipedia, allowing us to calculate the lag between journal article publication and inclusion in Wikipedia. We found that most new journal articles are added to Wikipedia soon after they are published. Together, these data support our hypothesis that users of Wikipedia add article citations for publications with high importance and relevance, with the little lag time between publication and inclusion in Wikipedia. It should be noted, however, that each social media has its own user community, which will ultimately bias the type of articles discussed. From our survey of journal article citations in Wikipedia, we see that diseases and disorders are heavily researched in Wikipedia, and the corresponding pages are likely better curated. We found that the most cited topics in Wikipedia were related to genomics, which may further limit the generalizability of our findings. We would also like to note that we did not study feed-back loops that may exist between article impact measures and inclusion of articles in Wikipedia. It may well be that articles mentioned in Wikipedia trigger interest that subsequently influence the number of citations. Finally, we did not consider further article impact measures, such as journal article downloads and other usage metrics11, which may be equally informative to measure article relevance. Despite these shortcomings, we believe that our data support the notion that PubMed articles listed in Wikipedia are of interest and relevance.
Conclusion
We conclude that Wikipedia selectively lists high impact journal articles soon after they are published, and thus represents a useful resource for identifying relevant articles. We hope that our study will spur further exploration of using social media for measuring journal paper reception and impact.
References
- 1.Mandavilli A. Peer review: Trial by Twitter. Nature. 2011;469(7330):286. doi: 10.1038/469286a. [DOI] [PubMed] [Google Scholar]
- 2.Wets K, Weedon D, Velterop J. Post-publication filtering and evaluation: Faculty of 1000. Learned Publishing. 2003;16(4):249–258. [Google Scholar]
- 3.Wikipedia dump. Available at http://dumps.wikimedia.org/enwiki/20100130/pages-meta-history.xml.7z.
- 4.xml2sql. Available at http://meta.wikimedia.org/wiki/xml2sql.
- 5.Sayers E, Wheeler D, National Center for Biotechnology Information (US) Building Customized Data Pipelines Using the Entrez Programming Utilities (eUtils) NCBI; 2004. [Google Scholar]
- 6.Singh J. Directory of open access journals. Indian Journal of Pharmacology. 2005;37(3):198. [Google Scholar]
- 7.Fatemi SH, Earle JA, McMenomy T. Reduction in Reelin immunoreactivity in hippocampus of subjects with schizophrenia, bipolar disorder and major depression. Molecular psychiatry. 2000;5(6):654. doi: 10.1038/sj.mp.4000783. [DOI] [PubMed] [Google Scholar]
- 8.Botella-López A, Burgaya F, Gavín R, García-Ayllón M, Gómez-Tortosa E, Peña-Casanova J, Ureña JM, Del Río JA, Blesa R, Soriano E, et al. National Acad Sciences. 2006. Reelin expression and glycosylation patterns are altered in Alzheimers disease. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Fatemi SH. The role of Reelin in pathology of autism. Molecular psychiatry. 2002;7(9):919. doi: 10.1038/sj.mp.4001248. [DOI] [PubMed] [Google Scholar]
- 10.PubMed. Total citation count at http://www.ncbi.nlm.nih.gov/pubmed/.
- 11.Bollen J, Rodriguez MA, Van de Sompel H. Mesur: usage-based metrics of scholarly impact. Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries; ACM; 2007. pp. 474–474. [Google Scholar]