Table 2.
Existing dataset for keyword extraction from 2003 to 2015
| Name | #Doc | General description |
|---|---|---|
| CSTR | 1800 | The computer science technical reports are considered in this dataset which was introduced in 1999. Witten et al. (2005) |
| Inspec | 2000 | The abstracts from 1998 to 2002 of Computers and Control, and Information Technology. Hulth (2003) |
| eBooks | 101 | The eBook dataset is randomly chosen from all kinds of fields. Huang et al. (2006) |
| Nguyen2007 | 211 | Corpus of scientific publications annotated for keyphrases Nguyen and Kan (2007) |
| Wiki20 | 20 | Links all important phrases in a document to Wikipedia articles. |
| Schutz2008 | 1231 | It consists of research papers selected from PubMed Central and are distributed across 254 different journals. Schutz (2008) |
| Fao30 | 30 | FAO of the UN, 30 documents. Crowdsourcing by six professional annotators at FAO. Medelyan et al. (2009) |
| Fao780 | 780 | FAO of the UN, 780 documents. Crowdsourcing by six professional annotators at FAO. Medelyan and Witten (2008) |
| Krapivin2009 | 2304 | Full CS journal scientific articles in ACM from 2003 to 2005 Krapivin et al. (2009) |
| Citeulike180 | 180 | The dataset is based on a subset of CiteULike.org containing documents that are indexed with at least three keywords on which at least two users have agreed. Medelyan et al. (2009) |
| SemEval2010 | 244 | Full scientific articles from ACM, created for SemEval2010 Task 5 |
| SemEval2010 | 284 | ACM Digital Library papers in four ACM 1998 classification distributed systems; information search and retrieval; distributed artificial intelligence (multiagent systems); and social and behavioral sciences (economics). Kim et al. (2010) |
| MPQA | 535 | It is based on data of news reports available from 187 various US and foreign news sources from June 2001 to May 2002. |
| Marujo2012 | 450 | The Portuguese news stories were adapted in English to carry out keyword extraction in English language. Marujo et al. (2013) |
| 500N-KPCrowd-v1.1 | 500 | 10 different categories (art and culture; business; crime; fashion; health; politics us; politics world; science; sports; technology) with 50 docs per category |
| 110-PT-BN-KP | 110 | News from the European Portuguese ALERT Broadcast News database. Marujo et al. (2013) |
| Wikinews | 100 | French corpus created from the French version of WikiNews that contains 100 news articles published between May 2012 and December 2012. Bougouin et al. (2013) |
| KDD | 704 | The abstracts from the articles of KDD conference papers are collected Caragea et al. (2014) |
| WWW | 1248 | The abstracts from the articles of WWW conference papers are collected Caragea et al. (2014) |
| Blogs | 14000 | Authors merged all the posts in a blog into one large document, because TF-IDF is usually applied to a set of single documents. No special features or linking were used in this dataset. Park et al. (2014) |
| CACIC | 888 | Spanish articles published between 2005 and 2013 in the Argentine Congress of Computer Science Aquino and Lanzarini (2015) |
| PubMed | 500 | Full-text papers collected from PubMed Central, which comprises over 26 million citations for biomedical literature Song et al. (2015) |