Skip to main content
. Author manuscript; available in PMC: 2023 Jul 1.
Published in final edited form as: J Labor Econ. 2022 May 20;40(3):613–667. doi: 10.1086/717730

Appendix Table A3:

Comparing Words from Alternative Corpora

Complete Wikipedia Spider-crawl Wikipedia Strict string match
Stereotype Word CS Score Word CS Score Word
Attractive unattractive 0.7458 desirable 0.6811 attractive
appealing 0.6966 advantageous 0.6603
desirable 0.6878 unattractive 0.6250
unappealing 0.6699 appealing 0.6150
alluring 0.6612 hospitable 0.5845
enticing 0.6471 palatable 0.5710
ideal 0.6290 favorable 0.5578
advantageous 0.6268 apt 0.5564
agreeable 0.6207 adaptable 0.5564
interesting 0.6175 comfortable 0.5479
Hearing hearings 0.5888 judgment 0.4814 hearing
testifying 0.5629 deaf 0.4631
testimony 0.5403 hearings 0.4614
arraignment 0.5272 tinnitus 0.4585
pleading 0.5148 earplugs 0.4549
sentencing 0.5078 testifying 0.4473
testified 0.5051 cochlear 0.4447
committal 0.4971 proceedings 0.4426
questioning 0.4929 auditory 0.4397
complaint 0.4853 trial 0.4374
Memory memories 0.6628 memories 0.7152 memory
brain 0.5274 amnesia 0.5725
recollection 0.5185 hippocampus 0.5631
cpu 0.5138 cognition 0.5625
remembering 0.5051 cognitive 0.5617
eidetic 0.4975 retrieval 0.5589
scratchpad 0.4933 recollection 0.5514
cache 0.4873 episodic 0.5478
rom 0.4862 perceptual 0.5418
consciousness 0.4770 reconsolidation 0.5093
Physically able unable 0.6887 unable 0.6910 physically able
enough 0.6050 willing 0.5993
willing 0.5912 enough 0.5800
trying 0.5904 psychologically 0.5680
ability 0.5865 trying 0.5513
needed 0.5696 unwilling 0.5365
attempting 0.5556 expected 0.5363
allowed 0.5549 anxious 0.5294
psychologically 0.5535 attempting 0.5281
unwilling 0.5505 eager 0.5271
Number of articles appx. 5,500,000 65,532
Number of words 885,424 260,073

Note: The table presents examples of result from training models on different corpora. Complete Wikipedia includes all the articles; it has 885,424 words and 200-dimensional vectors. Spider-crawl (scrapy spider) Wikipedia starts with articles referring to stereotypese3, ageism, and labor markets, as explained in the text. Both use 200-dimensional vectors. The “Word” column lists the words with the highest similarity scores; the scores are reported in the “CS Score” column.