Power users of AI agents are caught in a feedback loop that looks less like productivity and more like gambling. When an AI agent delivers inconsistent results across identical prompts, users do not walk away. They retry, rephrase, and pull the lever again, chasing the output that finally clicks. Peer-reviewed research now explains why: the way large language models handle and express uncertainty triggers psychological patterns that keep users engaged well past the point of diminishing returns. The phenomenon matters because AI agents are rapidly entering high-stakes workflows in finance, research, and operations, where compulsive retrying could distort decisions rather than improve them.
What is verified so far
Three distinct lines of academic research converge on the same finding: the unpredictability of AI agent outputs is not a bug users can ignore but a behavioral trap with measurable effects on trust, satisfaction, and decision-making.
A peer-reviewed paper in Psychological Science in the Public Interest examines how humans interpret the uncertainty baked into LLM outputs. The authors show that AI-generated text can be psychologically “sticky,” meaning users tend to over-interpret variable responses as meaningful signals rather than treating them as probabilistic guesses. This stickiness arises from the gap between how models actually handle uncertainty and how people perceive that uncertainty. Users often anthropomorphize model variability, reading intention or hidden knowledge into what is, at its core, statistical noise.
Experimental evidence from the International Journal of Human-Computer Studies adds a behavioral dimension. In controlled tasks, researchers tested how different forms and levels of uncertainty language from LLMs affect user behavior. When a model hedges with phrases like “I’m not sure,” participants report lower trust in that specific answer but paradoxically show higher retry rates. That pattern (reduced confidence in any single result driving more attempts) maps directly onto variable-reward dynamics familiar from behavioral psychology. Users keep pulling the lever precisely because outcomes vary and occasional “jackpot” answers feel especially rewarding.
On the engineering side, a technical preprint on arXiv tackles the problem from within the system itself. The work proposes a method for quantifying and propagating uncertainty across multi-step agent workflows, arguing that without such mechanisms, agent runs remain inconsistent in ways that are difficult for users or developers to predict. The authors frame unpredictability as an inherent systems problem in LLM-based agents, not merely a matter of user misperception. Random failures compound across steps, and current agent architectures lack robust ways to flag when a chain of actions is especially unreliable.
Taken together, these strands of evidence support a specific claim: LLM agents combine internal randomness, opaque uncertainty, and persuasive language in ways that can nudge users into repeated querying and over-interpretation. That pattern looks less like rational sampling of a probabilistic tool and more like a slot machine that occasionally pays out a perfect answer.
What remains uncertain
Several important questions remain open, and the available evidence does not yet resolve them.
First, the degree to which the slot-machine dynamic leads to measurably worse decisions in professional settings is not established by current studies. The psychological work explains the mechanism, and the HCI experiment documents the retry behavior, but neither tracks long-term outcomes for power users who rely on AI agents daily. Whether compulsive retrying degrades expertise over time, or instead helps users triangulate better answers, has not been tested in longitudinal research that follows real teams over months or years.
Second, the research corpus contains no direct statements from major AI developers on how they design uncertainty handling in their agents. The arXiv preprint proposes a technical solution for propagating uncertainty, but it is unclear whether any commercial platform has adopted similar methods or run large-scale evaluations. The gap between academic proposals and deployed products is significant. Without concrete disclosures from vendors about sampling settings, uncertainty calibration, and guardrails, users cannot know how much randomness in their results is addressable versus structural to current architectures.
Third, the experimental HCI evidence measures user responses to verbalized uncertainty under specific lab conditions. The available summaries do not fully detail recruitment methods, sample sizes, or effect magnitudes. That makes it difficult to judge how robust the retry effect is across different populations and task types. Readers should treat these behavioral findings as directional rather than definitive until the full methodology and data can be independently scrutinized.
Finally, it is unclear whether the retry behavior is uniform across user types. Power users (those who build workflows around agents, chain tools, and run dozens of prompts per day) may respond differently from casual users who only ask occasional questions. Expertise, domain knowledge, and familiarity with probabilistic reasoning could all moderate the effect. For example, a statistician might treat inconsistent outputs as expected sampling noise, while a non-technical professional might infer hidden insight from each variation. Current research does not segment results in a way that cleanly answers these questions.
How to read the evidence
The three sources supporting the slot-machine framing differ in type and strength, and that distinction matters for anyone trying to assess the claim.
The psychological paper in Psychological Science in the Public Interest is peer-reviewed work in a well-established journal. It provides the theoretical backbone: an explanation of why LLM uncertainty is psychologically sticky and how human metacognition interacts with model-generated text. Its strength lies in connecting LLM behavior to decades of research on confidence, belief updating, and anthropomorphism. At the same time, it is primarily analytical rather than focused on real-world agent workflows. It explains why the effect should exist, drawing on established psychology, more than it measures how often it appears in daily tool use.
The International Journal of Human-Computer Studies article offers experimental HCI evidence. By manipulating how an LLM expresses uncertainty and observing subsequent user choices, it provides the closest thing to a causal test of the slot-machine dynamic. The finding that hedged language reduces trust yet increases retries is particularly relevant for interface designers. However, like most lab experiments, it operates under controlled tasks and time-limited interactions. Translating those findings to complex professional settings (where users may cross-check with other tools, collaborate with colleagues, or be constrained by deadlines) requires caution.
The arXiv preprint on uncertainty propagation is a technical contribution that has not yet undergone peer review. It is valuable chiefly as a signal that engineers recognize the instability of multi-step agents as a first-class problem. The proposed framework for tracking and surfacing uncertainty across an agent’s planning and execution steps suggests concrete ways future systems might mitigate the slot-machine dynamic, for example by warning users when a run is likely to be fragile. But until such approaches are validated at scale and adopted in production, they remain promising directions rather than established solutions.
Across these sources, the evidence is strongest on two points. First, LLM outputs are inherently probabilistic and often presented in ways that encourage over-interpretation, especially when answers are fluent and confident-sounding. Second, when users encounter variable answers and verbalized uncertainty, many respond by retrying more, not less, which can entrench a cycle of experimentation that feels productive but may not reliably improve outcomes.
Implications for power users
For people who lean heavily on AI agents, the research suggests a practical reframe. Instead of treating each new run as another chance to “win” a perfect answer, it may be safer to think of the agent as a sampler from a distribution of plausible responses. In that framing, the goal is not to keep pulling the lever until one answer feels satisfying, but to systematically compare a small number of diverse outputs, cross-check them against external sources, and then decide.
Designers and organizations deploying agents into critical workflows may also need to revisit interface choices. Clearer signaling of uncertainty, limits on unstructured retries, and better tools for side-by-side comparison of runs could help shift behavior away from slot-machine dynamics and toward deliberate evaluation. Until engineering solutions for uncertainty propagation mature and are widely adopted, the burden of resisting the pull of variable rewards will fall largely on users and the institutions that set norms for how these systems are used.
More from Morning Overview
*This article was researched with the help of AI, with human editors creating the final content.