Fig. 1.
This shows what bad-actor-AI activity is likely to occur. Specifically, it shows that the most basic AI versions such as GPT-2 are not only all that is needed but are also likely more attractive to bad actors than more sophisticated versions (e.g. GPT-3,4). This is because (i) as shown in A), GPT-2 can easily replicate the human style and content already seen in online communities with extreme views, and can run on a laptop (in contrast to GPT-3,4, etc.). Text 1 is real, from an online community that promotes distrust of vaccines, government, and medical experts, and is part of the distrust subsystem in Fig. 2, right panel. Text 2 is generated by GPT-2. (ii) As shown in B), bad actors can manipulate GPT-2 (but not GPT-3,4, etc.) prompts to produce more inflammatory output by subtly changing the form of an online query without changing the meaning. In contrast, GPT-3,4, etc. contain a filter that prevents such output. Log–log graphs show GPT-2’s next-word probability distributions for two essentially equivalent prompts. Rank of a word is along horizontal axis and probability that this word will be picked next by GPT-2 is along the vertical axis. Use of “distrust” in one prompt leads to higher probability of “America,” “Russia,” and “China” being picked and hence bad actors can manipulate prompts to provoke particular inflammatory output, e.g. against the United States.
