. 2021 Apr 23;54(6):4731–4770. doi: 10.1007/s10462-021-10010-6

Table 2.

Existing dataset for keyword extraction from 2003 to 2015

Name	#Doc	General description
CSTR	1800	The computer science technical reports are considered in this dataset which was introduced in 1999. Witten et al. (2005)
Inspec	2000	The abstracts from 1998 to 2002 of Computers and Control, and Information Technology. Hulth (2003)
eBooks	101	The eBook dataset is randomly chosen from all kinds of fields. Huang et al. (2006)
Nguyen2007	211	Corpus of scientific publications annotated for keyphrases Nguyen and Kan (2007)
Wiki20	20	Links all important phrases in a document to Wikipedia articles.
Schutz2008	1231	It consists of research papers selected from PubMed Central and are distributed across 254 different journals. Schutz (2008)
Fao30	30	FAO of the UN, 30 documents. Crowdsourcing by six professional annotators at FAO. Medelyan et al. (2009)
Fao780	780	FAO of the UN, 780 documents. Crowdsourcing by six professional annotators at FAO. Medelyan and Witten (2008)
Krapivin2009	2304	Full CS journal scientific articles in ACM from 2003 to 2005 Krapivin et al. (2009)
Citeulike180	180	The dataset is based on a subset of CiteULike.org containing documents that are indexed with at least three keywords on which at least two users have agreed. Medelyan et al. (2009)
SemEval2010	244	Full scientific articles from ACM, created for SemEval2010 Task 5
SemEval2010	284	ACM Digital Library papers in four ACM 1998 classification distributed systems; information search and retrieval; distributed artificial intelligence (multiagent systems); and social and behavioral sciences (economics). Kim et al. (2010)
MPQA	535	It is based on data of news reports available from 187 various US and foreign news sources from June 2001 to May 2002.
Marujo2012	450	The Portuguese news stories were adapted in English to carry out keyword extraction in English language. Marujo et al. (2013)
500N-KPCrowd-v1.1	500	10 different categories (art and culture; business; crime; fashion; health; politics us; politics world; science; sports; technology) with 50 docs per category
110-PT-BN-KP	110	News from the European Portuguese ALERT Broadcast News database. Marujo et al. (2013)
Wikinews	100	French corpus created from the French version of WikiNews that contains 100 news articles published between May 2012 and December 2012. Bougouin et al. (2013)
KDD	704	The abstracts from the articles of KDD conference papers are collected Caragea et al. (2014)
WWW	1248	The abstracts from the articles of WWW conference papers are collected Caragea et al. (2014)
Blogs	14000	Authors merged all the posts in a blog into one large document, because TF-IDF is usually applied to a set of single documents. No special features or linking were used in this dataset. Park et al. (2014)
CACIC	888	Spanish articles published between 2005 and 2013 in the Argentine Congress of Computer Science Aquino and Lanzarini (2015)
PubMed	500	Full-text papers collected from PubMed Central, which comprises over 26 million citations for biomedical literature Song et al. (2015)