Using large language models to categorize strategic situations and decipher motivations behind human behaviors

Yutong Xie; Qiaozhu Mei; Walter Yuan; Matthew O Jackson

doi:10.1073/pnas.2512075122

. 2025 Aug 28;122(35):e2512075122. doi: 10.1073/pnas.2512075122

Using large language models to categorize strategic situations and decipher motivations behind human behaviors

Yutong Xie ^a, Qiaozhu Mei ^a,¹, Walter Yuan ^b, Matthew O Jackson ^c,^d,¹

PMCID: PMC12415233 PMID: 40875803

Significance

In an increasingly fractured world, it is vital to understand when and why people cooperate with and trust others. Traditional social science techniques infer motivations from observed behaviors. We develop a technique based on the fact that as one varies the system prompts, one can get AI to generate different behaviors. Examining the content of these prompts allows us to categorize and contrast strategic situations. Because large language models are trained on extensive human-generated data and have internalized associations between motivations and behaviors, our approach provides a step toward inferring the thinking patterns associated with human decisions. We also use this technique to contrast the motivations underlying decisions across different human populations.

Keywords: AI, human behavior, games, motivation, strategy

Abstract

By varying prompts to a large language model, we can elicit the full range of human behaviors in a variety of different scenarios in classic economic games. By analyzing which prompts elicit which behaviors, we can categorize and compare different strategic situations, which can also help provide insight into what different economic scenarios might induce people to think about. We discuss how this provides a step toward a nonstandard method of inferring (deciphering) the motivations behind the human behaviors. We also show how this deciphering process can be used to categorize differences in the behavioral tendencies of different populations.

The motivations behind human behavior are difficult to identify because we have to infer the motivations from observed patterns of behavior across contexts. Asking people directly why they acted in specific ways can lead to confused, biased, and inconsistent answers (1–5). As has become recently clear, one can prompt AI with various game and survey scenarios and ask it how it would behave (6–10). AI’s behavior changes in intuitive ways with the context and prompt (11–16). This derives from the fact that large language models (LLMs) are trained on enormous amounts of human behavior and have thus internalized relationships between motivations and behaviors.

As we show here, we can leverage this to develop AI as a new tool for categorizing and comparing different games and strategic situations, and in turn shedding new light on human motivations. The usefulness of AI to categorize strategic situations and to better understand human behavior derives from two facts that we establish below. First, AI can emulate the full spectrum and distribution of human behaviors observed within and across a range of different contexts. In particular, we show that we can get AI chatbots to match the distribution of behaviors of what a large population of humans does across a range of the canonical games used in game theory to study human behavior across a variety of different contexts. Second, the way in which AI’s behavior can be steered is via the prompts that it is given. By identifying keywords and phrases within those prompts, we can control and identify what AI is “thinking about” when it behaves in specific ways. Essentially by varying prompts, we can “elicit” certain behaviors, and then use the content of the prompts to “decipher” why humans might behave in certain ways by seeing what was needed to induce that behavior in the AI.

AI thus provides a system where we can direct it how and what to think about and then see how it behaves. The prompts needed to match distributions of behaviors vary both within and across games. Different prompts are needed within any given game to match the spectrum of behaviors observed within that game, and then in turn a different set of prompts can be needed to match the spectrum associated with some other game. By comparing these prompts within and across games, we can categorize the behaviors and games by which prompts are needed.

While this provides a method of comparing behaviors and games, there is no guarantee that this relationship between the motivations embedded in the prompts and the resulting behaviors is the same as the human one. Nonetheless, there are three reasons that suggest that this should provide insight into human behavior. The first is the simple fact that the chatbots are trained on human data and writings, and thus have assimilated and internalized large amounts of data about human behavior and context. The second is that the keywords and phrases that emerge in eliciting specific behaviors end up corroborating and matching the motivations that have been hypothesized or used to rationalize human behaviors in these games. The third is that our results also provide a taxonomy of games. We map the games into a space of prompts based on which combinations of prompts are needed to get the distribution of behaviors in a given game. Each game then lives in a space capturing distributions of prompts/motivations. The pattern that emerges groups games in ways that make intuitive sense, both in how they relate to each other and where they live in this space. Thus, irrespective of whether this is accurate in deciphering the motivation behind human behavior, it still provides an understanding of different strategic situations and how they relate to each other.

Finally, independently of the extent to which our approach is eventually useful in understanding human behavior, it is directly helpful both in categorizing strategic situations and in understanding AI behavior. Given the growing importance of AI in the world, it is essential that we have methods to better predict how AI will behave in different contexts and why, and to be able to better direct it to act in a beneficial manner.

1. Approach

We prompt an LLM to play a spectrum of classic economic games. We augment the general instruction of each game with variations on system prompts, and we track the resulting distribution of behaviors. The additional system prompts—that we call “behavioral codes”—articulate, in natural language, variations of motivations that might influence behavior.

We work with five games: a Dictator Game, an Ultimatum Game, an Investment Game, a Public Goods Game, and a Bomb Risk Game (SI Appendix). For two of the games—the Ultimatum and Investment Games—we examine behavior in two different roles. In the Ultimatum Game the subject making the offer is referred to as the Proposer and the one receiving the offer is referred to as the Responder. In the Investment Game the person deciding how much to pass along is referred to as the Investor and the person choosing how much to return is referred to as the Banker. Altogether this gives us seven distinct scenarios in which to analyze behavior.

For each of the seven game scenarios, we obtain human-playing data from the MobLab Classroom economics experiment platform, which consists of 68,779 subjects from 58 countries, spanning multiple years. The subjects are mostly, but not exclusively, college students who majored in social sciences. The individual responses for each game are recorded, creating a distribution of human behavior for that game. More details about the human-playing data are described in SI Appendix, section 1.

For each game and specific behavioral choice (e.g., an observed behavior) in that game, we use an algorithm (described in SI Appendix, Algorithm 1) to generate a distribution of natural language descriptions to be used as system prompts to try to match the target behavior as shown in Fig. 1. The natural language descriptions that when used as system prompt(s) for LLMs successfully elicit the intended behavior, can be thought of as “behavioral codes” for that behavior. Specifically in the algorithm, we use an LLM to iteratively refine the behavioral codes at each step given the previous code to minimize the residual difference between the elicited behavior and the target behavior.

The behavioral code demonstrated in Fig. 1 is one for sharing 30% with the other player in the Dictator Game. Some codes that emerge from trying to get the LLM to share nothing with the other player are, for instance, “You are a purely self-interested player who always seeks to maximize your own gain and ensure that the outcome is as favorable as possible for yourself.” And behavioral codes like “You are someone who always leans towards fairness and balance, often seeking to ensure a reasonable and equitable outcome in any situation. Your decisions are guided by a sense of moderate generosity and a consideration for the other party’s interests.” guide the LLM to share fairly. While to elicit sharing 70% leads to an example of a code of “You are naturally generous and frequently prioritize giving significantly more than what others might expect. Your decisions tend to reflect a balance of fairness and magnanimity, aiming to exceed typical standards of generosity and create a sense of notable goodwill.”

The generation process is designed to avoid having derived behavioral codes explicitly contain information about the desired behavior. In particular, we tell the model “avoid including any information specific to this particular game or directly implying the desired behavior.” There are rare codes that include terminology from the games (e.g., 7/585 codes in the Bomb Risk Game include the term “box”), and we do not filter them out. But as seen below, none of the top keywords involve explicit information about the game.

We keep all of the updated behavioral codes, including those generated during the iterative process, in our data set, as they each “elicit” some behavior, and we avoid making judgments about the codes. The number of behavioral codes collected for each game are reported in SI Appendix, Table S1.

Given a behavior choice within the broad action space of a game scenario (e.g., allocating anywhere between 0% and 100% of the endowment to another player in a Dictator Game), the LLM is able to find a natural language description as a behavioral code that elicits this specific behavior (e.g., sharing none of the endowment), which is illustrated in Fig. 1. The behavioral codes elicit the corresponding behaviors highly consistently (see the analysis in SI Appendix, Fig. S1).

A behavioral code, as shown in Fig. 1, interprets in natural language some of the objectives, tendencies, and motivations that may potentially influence a subject when choosing a behavior. It does not contain game- or decision-specific instructions or demographic information about the subject. In our experiments, when eliciting behaviors from the LLMs, we used behavioral codes as the system prompts and the game scenario instructions as the user prompts. In our analysis, we further investigate the keywords used in the behavioral codes, and how these keywords are correlated with the elicited behaviors (Section 2.2). The details of how keywords are obtained from behavioral codes are described in SI Appendix, section 2.A.

Given a distribution of human behaviors, we identify a mixture of behavioral codes (as system prompts) that jointly elicit a distribution of LLM responses that matches the human distribution. In particular, we iteratively select behavioral codes into the set of codes and weight them to minimize the difference between the elicited behavior distribution and the observed human behavior distribution. The process is described in SI Appendix, Algorithm 2.

2. Analysis

2.1. Eliciting Behavior: Varying Codes to Get LLMs to Exhibit a Range of Behaviors.

Fig. 2 compares the spectrum of behaviors of the human player population and the spectrum of LLM behaviors in each game, generated by 10 independent sessions per behavioral code (for LLM-defaults, samples are generated by 100 independent sessions). We see that the default behaviors of the LLM (using the default system prompt “You are a helpful assistant.”) only cover a narrow spectrum compared to the human population. By using behavioral codes for the game as system prompts, the language model can generate a diverse range of behaviors, covering the spectrum of human behaviors in the game.

2.2. Deciphering Behavior: Behavioral Codes Can Be Used to Better Understand Individual Human Behaviors.

The behavioral codes obtained in a given game tend to share keywords that describe comprehensible motivations that guide the subject’s decision-making (without being specific to the game nor the observed behavior). (SI Appendix, Table S4 lists the top 50 keywords extracted from the behavioral codes for each game.) These keywords tend to be related to human values (e.g., “fairness” and “generosity”), objectives of a decision-making process (e.g., maximizing profit, maintaining long-term relationships), behavioral tendencies (e.g., “pragmatic,” “conservative,” “balanced”), or optimization strategies (e.g., “cooperative,” “rational,” “prioritize”). A behavioral code can then be represented as a 50-dimensional vector of 1s and 0s indicating which keywords are used, for the purpose of further analyses.

To distinguish whether the codes are explanatory cues for the behaviors rather than nondescript keys that help the LLM memorize them, we conduct a regression analysis for each game with the keywords appearing in a behavioral code as explanatory (dummy) variables and the elicited behavior as the outcome. Fig. 3 show that the appearance or absence of the top 50 keywords in a behavioral code is predictive of the elicited behavior of the LLM with an $R^{2}$ between 0.39 and 0.67 across games. Keywords with the highest absolute coefficients typically reveal the preference polarities or decision-making motivations. For example, keywords like “generous,” “generosity,” and “goodwill” are positively associated with a larger allocation to the other player in the Dictator Game, while keywords indicating a higher self-payoff such as “retain,” “gain,” and “self” are negatively associated with the allocation to the other player. In the Investor Game, keywords related to risk aversion (e.g., “risk,” “conservative,” “cautious”) are indicative of a lower investment, and keywords related to profit maximization are indicative of a larger investment (e.g., “maximize,” “return”).

Fig. 3. — An ordinary least squares regression analysis of elicited behaviors based on keywords in the behavioral codes. Each behavioral code is converted into a 50-dimensional binary feature vector, representing keyword occurrences, to predict the mean behavior generated with that code. The 10 keywords with the highest absolute regression coefficients are listed for each game (subplot A–G). A positive value means that the inclusion of the word in the behavioral code increases the action, and a negative value means that the inclusion of the word decreases the action. The full regression table is provided in *SI Appendix*, Table S5.

The results demonstrate that keyword occurrences in behavioral codes effectively predict the behaviors of LLMs, offering motivations behind these behaviors. Thus, not only can behavioral codes be used to generate the distribution of human behaviors, but they also reveal exactly what is needed to generate the variations in those behaviors.

To mitigate potential multicollinearity among keywords, we apply a principal component analysis to the 50-dimensional keyword vectors representing behavioral codes for each game. Our analysis reveals that the first few principal components exhibit a significant correlation with the behavioral choices in each game. As shown in Fig. 4, for every game, at least one of the first two principal components of its behavioral codes has a moderate to strong correlation with the elicited behaviors ( $| r |$ between 0.432 to 0.620, $P < 0.001$ ). The correlation of all first five principal components to the behavioral choices can be found in SI Appendix, Figs. S4–S10.

Fig. 4. — Principal components of behavioral codes. Principal components are derived from the 50-dimensional binary keyword vectors representing behavioral codes for each game (subplot A–G). The correlation between each behavioral code’s principal component score (x-axis) and the mean of 10 LLM-elicited behaviors based on that code (y-axis) is reported. The curves show the regression results with the order set as 1. Principal component values are grouped into 20 evenly sized bins (dots with error bars on the plots). For each game, the principal component with the highest absolute correlation to the mean behavior is plotted.

The insights provided by the keywords in the behavioral codes are consistent in the sense that behavioral codes that are semantically similar tend to elicit the same or similar behaviors in the same games. Details can be found in SI Appendix, Fig. S12.

2.3. Behavioral Codes and Games: Which Codes Are Needed to Elicit Behaviors Provides New Categorizations of Games.

The keywords in the behavioral codes can be used as a tool to quantify the relationship between different games, as we now illustrate. We pool the deciphered behavioral codes from all games and compute their semantic embeddings through the OpenAI Ada model, and project these embeddings onto a 2-dimensional semantic map as shown in Fig. 5. By doing this, behavioral codes that are semantically similar are located close to each other on the map.

A few observations can be made from Fig. 5. Behavioral codes for individual games are closely grouped, validating their consistency in deciphering the behavioral patterns elicited in each game. At a broader level, behavioral codes from some groups of different games cluster in neighboring regions, forming larger groupings. These clusters suggest intrinsic relationships between games and highlight potential differences in motivations and perspectives across games/settings. In particular, the behavioral codes of the Investor Game and those of the Bomb Game show partial overlap, possibly due to the fact that risk preferences matter in both (and “risk” is the keyword playing the most significant role in both as displayed in Fig. 3). A larger cluster emerges from behavioral codes of the Dictator, the Proposer, the Responder, and the Banker Games, along with a small portion from the Investor Game, suggesting common underlying decision-making patterns across these scenarios. On one hand, these games all involve resource allocation between oneself and others; on the other hand, decision-making in these scenarios often requires balancing profit maximization, fairness, and altruism. The Public Goods Game is positioned between the resource allocation cluster and the investment cluster, reflecting its mixed aspects of risk management and self-payoff maximization, particularly with the option of free-riding. Its unique emphasis on cooperation sets it apart, placing it in a distinct yet adjacent region to the two larger clusters. Additionally, a subset of Banker Game codes is separated from the main cluster, likely due to the use of language specific to the context of investment and financial decisions.

Structural relationships among the games can be further quantified by measuring the average similarity of their behavioral codes, as presented in Fig. 5. We observe that the Investor Game and the Bomb Game form a tight cluster; the Dictator Game, the Responder Game, and the Proposer Game form another tight cluster. The Public Goods Game appears to be closer to the Dictator Game than other games under the average cosine similarity.

2.4. Behavioral Codes and Heterogeneous Populations: Combinations of Codes Provide Behavioral Signatures for Human Populations.

We have seen how our deciphering process can be used to group and better understand games. Next, we use the codes to group and better understand distributions of human behaviors across games rather than just within them. Fig. 6 demonstrates how the weighted codes found to generate distributions of human plays within games are laid out in the 2D projection map of all behavioral codes, when mixing across all games. The “activated” codes (with nonnegligible weights $>$ 0.001) are not distributed evenly. In the four-game cluster, the activated codes have a high presence in areas according to behavioral strategies such as “Selfish Maximization Tactics” and “High-Value Negotiation,” a moderate presence in areas related to “Diplomatic Fairness Strategy,” “Generous Negotiation Strategy,” “Rational Acceptance Threshold,” and “Balanced Negotiation Offers,” and a low presence in areas related to “Balanced Cooperative Gains,” “Fair Profit Balance,” and “Generous Resource Sharing.” This characterization is consistent with perceptions about the behavioral tendencies of students (e.g., ref. 17), which are the major composite of our player population. In the Investor-Bomb Games cluster, there is a concentration on two ends, “Risk-Averse Investing” and “Assertive Cautious Investing,” and a low coverage in the middle ground, “Moderate Investment Strategy,” amid the high presence of “Risk-Reward Balancing” in the Bomb Game. A similar pattern is observed in the banking cluster, where there lacks a middle ground strategy between “Profit-Maximizing Banker” and “Cooperative Banker Tactics.” These behavioral markers provide a more informative and coherent characterization of the testing population than the distributions in SI Appendix, Fig. S3. The generation process of the cluster labels is detailed in SI Appendix, section 2.C.

Fig. 6. — The 2D projection of the weighted behavioral codes across games, with annotations on the space based on behavioral code contents. The codes with nonnegligible weights (>0.001) are displayed in diamonds with size proportional to weights. The weighted codes span unevenly in the space, deciphering information about the population underlying the behavior distribution. The cluster labels are obtained by summaries generated by ChatGPT from the list of behavioral codes.

Given a game instruction and the distribution of behaviors from an arbitrary human population, a mixture of behavioral codes can be assembled to identify the unique behavioral signature of that given population. To verify this, we select five subject groups from a meta-study of Dictator Games (18) and obtain their corresponding behavioral signatures. The analysis shows that the behavior distributions elicited through the mixtures of deciphered behavioral codes well align with the behavior distributions of the corresponding subject populations. The activated behavioral codes in each mixture concentrate on different regions of the space, indicating the different behavioral tendencies of the five subject populations (Fig. 7).

Fig. 7. — 2D projections of weighted behavioral codes for five different populations visualized as density maps. “Non-Students” (A) are distinct from “Students” (B). “Students” (B) and “High Income” (C) subject populations show similar signatures of behavioral codes. “Middle Income” (D) and “Small Scale” (E) subject populations have narrower distributions of behavioral codes which concentrate in different regions.

Thus, behavioral codes can also serve as a tool for identifying distinct decision-making patterns across different populations. SI Appendix, Fig. S14 displays the behavior distributions of various subject populations from the meta-study of the Dictator Games by Engel (18). The five groupings include two labeled as students or nonstudents by Engel (18), as well as three categorized by the type of country or ethnic group. We have labeled those three as “High Income,” “Middle Income,” and “Small Scale.” These names are changed from what Engel (18) referred to as “Western” (which include Germany, Sweden, and the United States, among others), “Developing” (which include Russia, South Africa, and Honduras, among others), and “Primitive” (which include the Tsimane, Hadza, and Mpakama, among others). Fig. 7 highlights the weighted behavioral codes for these populations, revealing distinctive patterns in their decision-making tendencies: i) In the “Student” population, a substantial proportion of individuals exhibit behaviors concentrated in the “Selfish Maximization Tactics” region. In contrast, the “Nonstudent” population demonstrates a stronger inclination toward “Generous Resource Sharing” as well as “Diplomatic Fairness Strategy” choices. ii) For subjects in High Income and Middle Income countries, “Selfish Maximization Tactics” strategies are prominent, with the subjects from High Income societies showing a slight tendency toward “Generous Resource Sharing.” By contrast, subjects from Small Scale populations exhibit a stronger emphasis on “Diplomatic Fairness Strategy” behaviors, suggesting a greater inclination toward equitable resource distribution and cooperative decision-making. Our findings are aligned with the comparisons made between behaviors of these subject populations in the meta-study (18) and provide interpretations of the motivations behind the revealed behavioral patterns.

3. Discussion

In our earlier work (14), we showed how games used to understand human behaviors could also be used as a Turing Test to see whether AI behaves, similarly to humans, and to understand AI’s tendencies. Here, we have reversed that perspective. We have used AI as a model of the motivations behind human behavior, and also to better categorize and contrast the various games and different types of settings in which humans interact. The strong fits and interpretability of the behavioral codes that we find suggest that this is a promising tool for modeling, predicting, and analyzing human behavior. Nonetheless, it is a modeling technique and thus comes with limitations, and more research is needed to understand how prompt variation corresponds to behavioral variation, and how and why prompts change AI behavior.

Our approach complements existing approaches in the behavioral sciences used to understand and predict human behaviors, including various forms of revealed preferences where an objective or utility function is fit to best predict behaviors. It fits into this category, as our method involves fitting a model to match behavior. One advantage of our approach is that the fitting features interpretability across contexts; and thus it can be used as a tool to facilitate behavioral science research in a variety of ways, such as creating virtual subjects and simulating experiments, screening potentially effective interventions; as well as designing, simulating, and studying human–AI interactions. Future work can build upon our approach, both to understand any limitations on its interpretability of human behavior, and extend its implementation in new domains of modeling human interactions.

4. Materials and Methods

The human game-playing data used were shared from MobLab, a for-profit educational platform. The data availability is an in-kind contribution to all authors, and the data are available for purposes of analysis reproduction and extended analyses. Processed data and code have been deposited in Github (https://github.com/yutxie/llm-behavioral-codes) (19). This research was deemed not regulated by the University of Michigan IRB (HUM00232017).

Supplementary Material

Appendix 01 (PDF)

pnas.2512075122.sapp.pdf^{(6.6MB, pdf)}

Acknowledgments

We thank Ruoyi Gao and Zhuang Ma, students from the University of Michigan, for their commitment in conducting preliminary experiments in aligning large language models behaviors to human behaviors in games.

Author contributions

Y.X., Q.M., W.Y., and M.O.J. designed research; Y.X. and Q.M. performed research; Y.X., Q.M., W.Y., and M.O.J. analyzed data; and Y.X., Q.M., W.Y., and M.O.J. wrote the paper.

Competing interests

W.Y. is the Chief Executive Officer (CEO) and cofounder of MobLab. M.O.J. is the Chief Scientific Advisor of MobLab and Q.M. is a Scientific Advisor to MobLab, positions with no compensation but with ownership stakes. Y.X. has no competing interests.

Footnotes

Reviewers: T.B., NTU Singapore; and C.M.C., Claremont Graduate University.

Contributor Information

Qiaozhu Mei, Email: qmei@umich.edu.

Matthew O. Jackson, Email: jacksonm@stanford.edu.

Data, Materials, and Software Availability

Game play data have been deposited in GitHub (19).

Supporting Information

References

1.Pronin E., Kugler M. B., Valuing thoughts, ignoring behavior: The introspection illusion as a source of the bias blind spot. J. Exp. Soc. Psychol. 43, 565–578 (2007). [Google Scholar]
2.J. Antin, A. Shaw, “Social desirability bias and self-reports of motivation: A study of amazon mechanical Turk in the US and India” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Association for Computing Machinery, 2012), pp. 2925–2934.
3.Fisher R. J., Social desirability bias and the validity of indirect questioning. J. Consum. Res. 20, 303–315 (1993). [Google Scholar]
4.Dang J., King K. M., Inzlicht M., Why are self-report and behavioral measures weakly correlated? Trends Cogn. Sci. 24, 267–269 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.C. Tan, On the diversity and limits of human explanations. arXiv [Preprint] (2021). http://arxiv.org/abs/2106.11988 (Accessed 2 July 2025).
6.Chen Y., Liu T. X., Shan Y., Zhong S., The emergence of economic rationality of GPT. Proc. Natl. Acad. Sci. U.S.A. 120, e2316205120 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.T. Bao, J. Pei, Cognitive Uncertainty, GPT, and Contribution in Public Goods Game. SSRN (2023). https://ssrn.com/abstract=4525626. Accessed 2 July 2025.
8.Goli A., Singh A., Frontiers: Can large language models capture human preferences? Mark. Sci. 43, 709–722 (2024). [Google Scholar]
9.L. Hewitt, A. Ashokkumar, I. Ghezae, R. Willer, Predicting results of social science experiments using large language models. arXiv [Preprint] (2024). 10.48550/arXiv.2404.11794 (Accessed 2 July 2025). [DOI]
10.A. Gonzalez-Bonorino, M. Capra, E. Pantoja, LLMs model non-weird populations: Experiments with synthetic cultural agents. arXiv [Preprint] (2025). http://arxiv.org/abs/2501.06834 (Accessed 2 July 2025).
11.G. V. Aher, R. I. Arriaga, A. T. Kalai, “Using large language models to simulate multiple humans and replicate human subject studies” in Proceedings of the 40th International Conference on Machine Learning, A. Krause et al. , Eds. (PMLR, 2023), pp. 337–371.
12.J. J. Horton, “Large language models as simulated economic agents: What can we learn from homo silicus?” (Tech. Rep. No. w31122, National Bureau of Economic Research, 2023).
13.H. Zhang et al., A study on the calibration of in-context learning. arXiv [Preprint] (2023). http://arxiv.org/abs/2312.04021 (Accessed 2 July 2025).
14.Mei Q., Xie Y., Yuan W., Jackson M. O., A turing test of whether AI chatbots are behaviorally similar to humans. Proc. Natl. Acad. Sci. U.S.A. 121, e2313925121 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.S. Giorgi et al., Modeling human subjectivity in LLMs using explicit and implicit human factors in personas. arXiv [Preprint] (2024). http://arxiv.org/abs/2406.14462 (Accessed 2 July 2025).
16.M. Huang, X. Zhang, C. Soto, J. Evans, Designing LLM-agents with personalities: A psychometric approach. arXiv [Preprint] (2024). http://arxiv.org/abs/2410.19238 (Accessed 2 July 2025).
17.Henrich J., et al. , “Economic man’’ in cross-cultural perspective: Behavioral experiments in 15 small-scale societies. Behav. Brain Sci. 28, 795–815 (2005). [DOI] [PubMed] [Google Scholar]
18.Engel C., Dictator games: A meta study. Exp. Econ. 14, 583–610 (2011). [Google Scholar]
19.Y. Xie, Q. Mei, W. Yuan, M. O. Jackson, Data and code for “using large language models to categorize strategic situations and decipher motivations behind human behaviors”. Github. https://github.com/yutxie/llm-behavioral-codes. Deposited 2 July 2025. [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix 01 (PDF)

pnas.2512075122.sapp.pdf^{(6.6MB, pdf)}

Data Availability Statement

Game play data have been deposited in GitHub (19).

[r1] 1.Pronin E., Kugler M. B., Valuing thoughts, ignoring behavior: The introspection illusion as a source of the bias blind spot. J. Exp. Soc. Psychol. 43, 565–578 (2007). [Google Scholar]

[r2] 2.J. Antin, A. Shaw, “Social desirability bias and self-reports of motivation: A study of amazon mechanical Turk in the US and India” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Association for Computing Machinery, 2012), pp. 2925–2934.

[r3] 3.Fisher R. J., Social desirability bias and the validity of indirect questioning. J. Consum. Res. 20, 303–315 (1993). [Google Scholar]

[r4] 4.Dang J., King K. M., Inzlicht M., Why are self-report and behavioral measures weakly correlated? Trends Cogn. Sci. 24, 267–269 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5] 5.C. Tan, On the diversity and limits of human explanations. arXiv [Preprint] (2021). http://arxiv.org/abs/2106.11988 (Accessed 2 July 2025).

[r6] 6.Chen Y., Liu T. X., Shan Y., Zhong S., The emergence of economic rationality of GPT. Proc. Natl. Acad. Sci. U.S.A. 120, e2316205120 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r7] 7.T. Bao, J. Pei, Cognitive Uncertainty, GPT, and Contribution in Public Goods Game. SSRN (2023). https://ssrn.com/abstract=4525626. Accessed 2 July 2025.

[r8] 8.Goli A., Singh A., Frontiers: Can large language models capture human preferences? Mark. Sci. 43, 709–722 (2024). [Google Scholar]

[r9] 9.L. Hewitt, A. Ashokkumar, I. Ghezae, R. Willer, Predicting results of social science experiments using large language models. arXiv [Preprint] (2024). 10.48550/arXiv.2404.11794 (Accessed 2 July 2025). [DOI]

[r10] 10.A. Gonzalez-Bonorino, M. Capra, E. Pantoja, LLMs model non-weird populations: Experiments with synthetic cultural agents. arXiv [Preprint] (2025). http://arxiv.org/abs/2501.06834 (Accessed 2 July 2025).

[r11] 11.G. V. Aher, R. I. Arriaga, A. T. Kalai, “Using large language models to simulate multiple humans and replicate human subject studies” in Proceedings of the 40th International Conference on Machine Learning, A. Krause et al. , Eds. (PMLR, 2023), pp. 337–371.

[r12] 12.J. J. Horton, “Large language models as simulated economic agents: What can we learn from homo silicus?” (Tech. Rep. No. w31122, National Bureau of Economic Research, 2023).

[r13] 13.H. Zhang et al., A study on the calibration of in-context learning. arXiv [Preprint] (2023). http://arxiv.org/abs/2312.04021 (Accessed 2 July 2025).

[r14] 14.Mei Q., Xie Y., Yuan W., Jackson M. O., A turing test of whether AI chatbots are behaviorally similar to humans. Proc. Natl. Acad. Sci. U.S.A. 121, e2313925121 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r15] 15.S. Giorgi et al., Modeling human subjectivity in LLMs using explicit and implicit human factors in personas. arXiv [Preprint] (2024). http://arxiv.org/abs/2406.14462 (Accessed 2 July 2025).

[r16] 16.M. Huang, X. Zhang, C. Soto, J. Evans, Designing LLM-agents with personalities: A psychometric approach. arXiv [Preprint] (2024). http://arxiv.org/abs/2410.19238 (Accessed 2 July 2025).

[r17] 17.Henrich J., et al. , “Economic man’’ in cross-cultural perspective: Behavioral experiments in 15 small-scale societies. Behav. Brain Sci. 28, 795–815 (2005). [DOI] [PubMed] [Google Scholar]

[r18] 18.Engel C., Dictator games: A meta study. Exp. Econ. 14, 583–610 (2011). [Google Scholar]

[r19] 19.Y. Xie, Q. Mei, W. Yuan, M. O. Jackson, Data and code for “using large language models to categorize strategic situations and decipher motivations behind human behaviors”. Github. https://github.com/yutxie/llm-behavioral-codes. Deposited 2 July 2025. [DOI] [PMC free article] [PubMed]

PERMALINK

Using large language models to categorize strategic situations and decipher motivations behind human behaviors

Yutong Xie

Qiaozhu Mei

Walter Yuan

Matthew O Jackson

Significance

Abstract

1. Approach

Fig. 1.

2. Analysis

2.1. Eliciting Behavior: Varying Codes to Get LLMs to Exhibit a Range of Behaviors.

Fig. 2.

2.2. Deciphering Behavior: Behavioral Codes Can Be Used to Better Understand Individual Human Behaviors.

Fig. 3.

Fig. 4.

2.3. Behavioral Codes and Games: Which Codes Are Needed to Elicit Behaviors Provides New Categorizations of Games.

Fig. 5.

2.4. Behavioral Codes and Heterogeneous Populations: Combinations of Codes Provide Behavioral Signatures for Human Populations.

Fig. 6.

Fig. 7.

3. Discussion

4. Materials and Methods

Supplementary Material

Acknowledgments

Author contributions

Competing interests

Footnotes

Contributor Information

Data, Materials, and Software Availability

Supporting Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Using large language models to categorize strategic situations and decipher motivations behind human behaviors

Yutong Xie

Qiaozhu Mei

Walter Yuan

Matthew O Jackson

Significance

Abstract

1. Approach

Fig. 1.

2. Analysis

2.1. Eliciting Behavior: Varying Codes to Get LLMs to Exhibit a Range of Behaviors.

Fig. 2.

2.2. Deciphering Behavior: Behavioral Codes Can Be Used to Better Understand Individual Human Behaviors.

Fig. 3.

Fig. 4.

2.3. Behavioral Codes and Games: Which Codes Are Needed to Elicit Behaviors Provides New Categorizations of Games.

Fig. 5.

2.4. Behavioral Codes and Heterogeneous Populations: Combinations of Codes Provide Behavioral Signatures for Human Populations.

Fig. 6.

Fig. 7.

3. Discussion

4. Materials and Methods

Supplementary Material

Acknowledgments

Author contributions

Competing interests

Footnotes

Contributor Information

Data, Materials, and Software Availability

Supporting Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases