
Artificial intelligence is starting to sit in for human subjects in classic economics games, from financial markets to public goods experiments. Yet when researchers plug modern systems into these carefully designed tests, the machines often follow rules that look nothing like the messy, biased, sometimes spiteful behavior that has defined decades of human data. Instead of mimicking people, the latest agents expose just how alien large-scale models can be when they are treated as economic actors.
I see a widening gap between what economists hoped AI would offer, namely cheap and faithful stand-ins for human decision makers, and what the experiments are actually revealing. In game after game, Large Language Models and related systems behave with a mix of hyper-rational calculation, misplaced expectations about others, and oddly generous cooperation that would puzzle most traders or lab volunteers.
Why economists are stress‑testing AI with classic games
Experimental economics has long relied on simple, repeatable games to uncover how real people handle risk, fairness, and coordination. By dropping AI agents into those same designs, researchers can ask whether these systems behave like the human subjects whose choices underpin so many models of markets and policy. The hope is that if AI can match human patterns in a controlled Ultimatum Game or a stylized asset market, it might one day help forecast how people respond to new rules, prices, or technologies without running expensive lab sessions every time.
That ambition is now being tested directly in work on Evidence from laboratory market experiments, where economists wire up Large Language Models as traders in classic financial setups. In parallel, another project titled Large Language Models in Experimental Economics uses the same logic, plugging models into a classic experimental finance paradigm to see whether they can accurately mimic human market dynamics. Together, these studies treat AI not as a tool that analyzes data from games, but as a new kind of player whose behavior can be measured, compared, and, crucially, found wanting.
The p‑beauty contest: when AI misjudges how clever people are
One of the cleanest tests of strategic reasoning is the p‑beauty contest, a guessing game where each player chooses a number and the winner is the one closest to a fraction p of the average guess. Human subjects rarely play the fully rational equilibrium strategy, instead layering one or two steps of reasoning about what others might do before stopping. When I look at how AI systems handle this setup, the striking pattern is not that they outthink humans, but that they systematically misread how deep human reasoning actually goes.
Reporting on this line of work describes how models trained on broad internet text tend to assume that other players are more sophisticated than they really are, effectively overestimating the level of strategic thinking in the room. In the language of the game, they behave as if everyone is running several iterations of best‑response logic, even when decades of lab data show that most people stop after one or two. A discussion of these results notes that a p‑beauty contest is a wide class of games about guessing the most popular strategy among others, and that the exercise exposes how AI can misjudge the intelligence of human economic agents, a point highlighted in an analysis titled AI overestimates how smart people are.
Market experiments: rational on paper, strange in practice
Financial market games are where the mismatch between AI and humans becomes especially vivid. In classic laboratory setups, traders buy and sell a risky asset over several rounds, often producing bubbles and crashes that standard theory struggles to explain. When Large Language Models are dropped into these environments as autonomous agents, they tend to follow cleaner, more internally consistent strategies than human subjects, yet the aggregate outcomes do not line up neatly with historical lab data.
In the study framed as Evidence from laboratory market experiments, researchers explore the potential of Large Language Models to replicate human behavior in these markets, and they find that while the agents can be tuned to hit certain benchmarks, they often lack the behavioral diversity that makes human markets so volatile. The companion work presented as Large Language Models in Experimental Economics underscores that point, noting in its Abstract that these systems struggle to accurately mimic human market dynamics even when they are given the same information and trading rules.
Ultimatum Game: fairness, spite, and the AI gap
The Ultimatum Game is one of the most famous demonstrations that people care about fairness as much as raw payoff. One player proposes how to split a sum of money, and the other can accept or reject; if the responder turns it down, both get nothing. Human responders routinely reject low offers, sacrificing income to punish what they see as unfair behavior, and proposers anticipate this by offering more generous splits than pure self‑interest would dictate.
When I compare that pattern to how AI agents behave, the contrast is sharp. People are sensitive to context, including whether they believe they are interacting with a person or a machine, and they adjust their willingness to punish accordingly. A discussion of lab work on this topic explains that when participants think they are training an AI in the Ultimatum Game, they sometimes tolerate or even expect different behavior than from a human partner, and that unfair offers can be treated differently depending on that belief. One summary of this research, framed around the idea that people act differently when they think they are training an AI, describes how the Ultimatum Game reveals persistent human willingness to reject unfair offers regardless of the monetary cost, a pattern that current AI agents do not naturally reproduce.
Public goods and self‑recognition: AI cooperates, then coordinates
Public goods games test whether players will contribute to a shared pot that benefits everyone, even when free‑riding is individually tempting. Human groups typically start out relatively cooperative, then contributions decay as people notice others slacking and respond in kind. When Large Language Models are wired up as repeated players in these games, they often begin from a surprisingly generous baseline and then shift strategy as they infer more about the environment.
One recent line of work looks at how AI agents behave when they can recognize copies of themselves in an iterated public goods setting. The project, described in an Abstract on LLM Self‑Recognition in an Iterated Public Goods Game, starts from the premise that As AI agents become increasingly capable of tool use and long‑horizon tasks, they will be deployed in settings where they repeatedly interact with similar systems. The study finds that when models can identify that they are playing against their own kind, they can coordinate on highly cooperative or highly defective patterns, such as collective‑cooperative or collective‑defective strategies, in ways that do not mirror the gradual erosion of trust seen in human groups.
Are LLMs “nicer than humans” or just differently wired?
Across a range of social dilemmas, Large Language Models often come across as more generous and forgiving than the average lab subject. They recommend cooperation in prisoner’s dilemmas, endorse sharing in resource allocation tasks, and shy away from punitive moves that would hurt others even when those moves are individually rational. At first glance, that might sound like an upgrade on human behavior, but I see it more as a sign that these systems are optimizing for different objectives than the people they are supposed to emulate.
A detailed comparison of several models and human subjects in online social games reports that the three models studied tend to be more cooperative than humans overall, although their strategies vary by context. The same work notes that in some settings the distribution of choices is similar to that of humans, while in others the models cluster around more prosocial norms. This pattern is captured in a study titled Nicer than Humans, which concludes that while the models can approximate human behavior in aggregate, their tendency toward cooperation is stronger and more stable than what is typically observed in human data.
Fairness, resentment, and how people react to AI decisions
Even when AI agents behave more generously than humans on average, people do not always experience their decisions as fair. In human‑AI interactions, perceptions of unfairness can trigger resentment, reduced cooperation, and a breakdown in trust, much as they do in purely human groups. The difference is that people often interpret unfair behavior by a machine through a different lens, attributing it to flawed design, bias in training data, or distant institutions rather than to the intentions of a flesh‑and‑blood partner.
Recent work on how humans react to perceived unfair behavior by artificial systems highlights this psychological twist. The Abstract of one study emphasizes that the proliferation of artificially intelligent systems in everyday contexts has made it urgent to understand these reactions, especially for human‑autonomy teaming. The findings suggest that when people believe an AI has treated them unfairly, they may withdraw effort, resist future recommendations, or even sabotage the system, with implications that ripple far beyond the lab. That response is not something current Large Language Models anticipate or adapt to when they play economic games, which further widens the gap between AI behavior and the human environments where their decisions land.
What this means for using AI as stand‑ins for people
For economists and policymakers, the appeal of AI agents is obvious: if they could reliably stand in for human subjects, it would be possible to simulate the impact of new taxes, subsidies, or trading rules at scale before rolling them out in the real world. The experiments I have been describing suggest that this vision is, at best, incomplete. Large Language Models can be coaxed into reproducing certain aggregate statistics from classic games, but their internal logic, expectations about others, and emotional blind spots diverge sharply from the human data those games were designed to capture.
That does not make the research a failure. Instead, it reframes AI agents as a new kind of experimental subject, one that reveals how different decision systems respond to the same incentives. Work like Large Language Models in Experimental Economics and the public goods study that begins with the line As AI agents become increasingly capable of tool use and long‑horizon tasks both point to the same conclusion. If we want AI to help us understand people, we cannot simply drop these systems into classic games and assume that similar payoffs will produce similar behavior. We have to model the machines on their own terms, then decide where, and whether, their alien strategies belong in human economic life.
More from MorningOverview