Skip to main content
. Author manuscript; available in PMC: 2017 Jun 26.
Published in final edited form as: Proc Conf Empir Methods Nat Lang Process. 2016 Nov;2016:595–605. doi: 10.18653/v1/D16-1057

Table 2.

Results on recreating known lexicons.

(a) Corpus methods outperform WordNet on standard English. Using word-vector embeddings learned on a massive corpus (1011 tokens), we see that both corpus-based methods outperform the WordNet-based approach overall.
Method AUC Ternary F1 τ
SentProp 90.6 58.6 0.44
Densifier 93.3 62.1 0.50
WordNet 89.5 58.7 0.34
Majority 24.8
(b) Corpus approaches are competitive with a distantly supervised method on Twitter. Using Twitter embeddings learned from ~109 tokens, we see that the semi-supervised corpus approaches using small seed sets perform very well.
Method AUC Ternary F1 τ
SentProp 86.0 60.1 0.50
Densifier 90.1 59.4 0.57
Sentiment140 86.2 57.7 0.51
Majority 24.9
(c) SentProp performs best with domain-specific finance embeddings. Using embeddings learned from financial corpus (~2× 107 tokens), SentProp significantly outperforms the other methods.
Method AUC Ternary F1
SentProp 91.6 63.1
Densifier 80.2 50.3
PMI 86.1 49.8
CountVecs 81.6 51.1
Majority 23.6
(d) SentProp performs well on standard English even with 1000x reduction in corpus size. SentProp maintains strong performance even when using embeddings learned from the 2000s decade of COHA (only 2 × ~107 tokens).
Method AUC Ternary F1 τ
SentProp 83.8 53.0 0.28
Densifier 77.4 46.6 0.19
PMI 70.6 41.9 0.16
CountVecs 52.7 32.9 0.01
Majority 24.3