Skip to main content
. 2021 Apr 23;54(6):4731–4770. doi: 10.1007/s10462-021-10010-6

Table 2.

Existing dataset for keyword extraction from 2003 to 2015

Name #Doc General description
CSTR 1800 The computer science technical reports are considered in this dataset which was introduced in 1999. Witten et al. (2005)
Inspec 2000 The abstracts from 1998 to 2002 of Computers and Control, and Information Technology. Hulth (2003)
eBooks 101 The eBook dataset is randomly chosen from all kinds of fields. Huang et al. (2006)
Nguyen2007 211 Corpus of scientific publications annotated for keyphrases Nguyen and Kan (2007)
Wiki20 20 Links all important phrases in a document to Wikipedia articles.
Schutz2008 1231 It consists of research papers selected from PubMed Central and are distributed across 254 different journals. Schutz (2008)
Fao30 30 FAO of the UN, 30 documents. Crowdsourcing by six professional annotators at FAO. Medelyan et al. (2009)
Fao780 780 FAO of the UN, 780 documents. Crowdsourcing by six professional annotators at FAO. Medelyan and Witten (2008)
Krapivin2009 2304 Full CS journal scientific articles in ACM from 2003 to 2005 Krapivin et al. (2009)
Citeulike180 180 The dataset is based on a subset of CiteULike.org containing documents that are indexed with at least three keywords on which at least two users have agreed. Medelyan et al. (2009)
SemEval2010 244 Full scientific articles from ACM, created for SemEval2010 Task 5
SemEval2010 284 ACM Digital Library papers in four ACM 1998 classification distributed systems; information search and retrieval; distributed artificial intelligence (multiagent systems); and social and behavioral sciences (economics). Kim et al. (2010)
MPQA 535 It is based on data of news reports available from 187 various US and foreign news sources from June 2001 to May 2002.
Marujo2012 450 The Portuguese news stories were adapted in English to carry out keyword extraction in English language. Marujo et al. (2013)
500N-KPCrowd-v1.1 500 10 different categories (art and culture; business; crime; fashion; health; politics us; politics world; science; sports; technology) with 50 docs per category
110-PT-BN-KP 110 News from the European Portuguese ALERT Broadcast News database. Marujo et al. (2013)
Wikinews 100 French corpus created from the French version of WikiNews that contains 100 news articles published between May 2012 and December 2012. Bougouin et al. (2013)
KDD 704 The abstracts from the articles of KDD conference papers are collected Caragea et al. (2014)
WWW 1248 The abstracts from the articles of WWW conference papers are collected Caragea et al. (2014)
Blogs 14000 Authors merged all the posts in a blog into one large document, because TF-IDF is usually applied to a set of single documents. No special features or linking were used in this dataset. Park et al. (2014)
CACIC 888 Spanish articles published between 2005 and 2013 in the Argentine Congress of Computer Science Aquino and Lanzarini (2015)
PubMed 500 Full-text papers collected from PubMed Central, which comprises over 26 million citations for biomedical literature Song et al. (2015)