. Author manuscript; available in PMC: 2019 Oct 1.

Published in final edited form as: IEEE Trans Knowl Data Eng. 2018 Mar 5;30(10):1825–1837. doi: 10.1109/TKDE.2018.2812203

TABLE 1.

Five real-world massive text corpora in different domains and multiple languages. |Ω| is the total number of words. size_p is the size of positive pool. To prove the domain-independence of our model, we will compare the results on the three English datasets, DBLP, Yelp, and EN, as they come from different domains. To demonstrate that our model works smoothly in different languages, we will compare the results on the three Wikipedia article datasets, EN, ES, and CN, as they are of different languages.

Dataset	Domain	Language	\|Ω\|	File size	size_p
DBLP	Scientific Paper	English	91.6M	618MB	29K
Yelp	Business Review	English	145.1M	749MB	22K
EN	Wikipedia Article	English	808.0M	3.94GB	184K
ES	Wikipedia Article	Spanish	791.2M	4.06GB	65K
CN	Wikipedia Article	Chinese	371.9M	1.56GB	29K