Skip to main content
. Author manuscript; available in PMC: 2019 Oct 1.
Published in final edited form as: IEEE Trans Knowl Data Eng. 2018 Mar 5;30(10):1825–1837. doi: 10.1109/TKDE.2018.2812203

TABLE 1.

Five real-world massive text corpora in different domains and multiple languages. |Ω| is the total number of words. sizep is the size of positive pool. To prove the domain-independence of our model, we will compare the results on the three English datasets, DBLP, Yelp, and EN, as they come from different domains. To demonstrate that our model works smoothly in different languages, we will compare the results on the three Wikipedia article datasets, EN, ES, and CN, as they are of different languages.

Dataset Domain Language |Ω| File size  sizep
DBLP Scientific Paper English 91.6M 618MB 29K
Yelp Business Review English 145.1M 749MB 22K
EN Wikipedia Article English 808.0M 3.94GB 184K
ES Wikipedia Article Spanish 791.2M 4.06GB 65K
CN Wikipedia Article Chinese 371.9M 1.56GB 29K