Morning Overview

Even Microsoft admits AI chatbots get dumber the longer you talk

Microsoft Research has produced a peer-reviewed study showing that the leading AI chatbots lose significant accuracy as conversations grow longer, confirming a frustration familiar to anyone who has tried to hold a detailed, multi-step dialogue with tools like ChatGPT or Copilot. The study documents an average 39% performance drop when top large language models handle multi-turn chats instead of single, fully specified prompts. That finding lands at a moment when businesses and consumers are relying on these systems for everything from customer service to software development, raising hard questions about whether the technology can be trusted for sustained, complex tasks.

A 39% Drop That Benchmarks Cannot Hide

The research, which is listed for presentation at ICLR 2026, tested both open-weight and closed-weight large language models on tasks that required back-and-forth dialogue rather than a single, neatly packaged question. When models received a complete prompt in one shot, they performed well. But when the same information was spread across multiple conversational turns, accuracy fell by an average of 39%. The gap held across a range of model families, meaning this is not a quirk of one vendor’s system but a structural weakness in how current LLMs process extended interactions.

That number matters because it captures something users have reported anecdotally for months: chatbots seem sharp at the start of a session and progressively less reliable as the dialogue continues. The study moves that impression from speculation to measured reality. By comparing single-turn and multi-turn conditions on the same underlying tasks, the researchers isolated the conversation itself as the variable driving the decline, not the difficulty of the questions. In other words, simply reformatting a problem as a dialogue (rather than a single, fully specified prompt) was enough to erase nearly four out of every ten correct answers, even when the model had already demonstrated that it could solve the problem in principle.

Why Enterprise Tool Use Makes It Worse

A separate line of research examined what happens when LLMs are asked to call external tools and functions during long conversations, the kind of workflow that powers enterprise assistants managing calendars, databases, and internal APIs. That work, known as LongFuncEval, found that performance drops as multi-turn conversations get longer and as the number of available tools increases. Longer tool outputs compounded the problem further, meaning that the more useful context a model received from an external system, the less reliably it could act on that context.

This is a direct concern for companies building AI-powered workflows. An assistant that books a meeting correctly on the first request but garbles a follow-up calendar change three turns later is worse than no assistant at all, because the user may not catch the error. The LongFuncEval findings suggest that degradation in tool-calling settings is not simply a side effect of longer text but a distinct failure mode tied to how models juggle structured function calls alongside natural language. As the number of tools and the complexity of their outputs grow, the model must track both conversational history and a web of intermediate results, and the research indicates that today’s systems routinely lose track of critical details along the way.

Benchmarks That Shaped the Conversation

Part of the reason the multi-turn problem went underappreciated for so long is that standard evaluation benchmarks were designed around single-turn prompts. One widely used dataset, GSM8K, consists of 8.5K grade-school math word problems originally built to test reasoning ability. Per its authors, the dataset was motivated by the need for reliable reasoning evaluation, and it has since become a reference point across many LLM assessments. But benchmarks like GSM8K were not designed to capture what happens when a user asks a follow-up question, corrects a mistake, or adds new constraints mid-conversation.

The Microsoft Research study and the LongFuncEval work both highlight this gap. Models that score well on single-turn math or logic tests can still fall apart when the same material is delivered through a realistic dialogue. That disconnect means leaderboard rankings, which companies and developers use to choose models, may overstate real-world reliability. A model that tops a single-turn benchmark could be significantly less capable in the conversational setting where most users actually interact with it. Until evaluation suites incorporate multi-turn and tool-using scenarios as first-class tasks, organizations risk making high-stakes deployment decisions based on numbers that only apply in idealized, one-shot conditions.

What a Context Refresh Could Change

One possible mitigation that researchers and engineers have discussed is a periodic “context refresh,” where the model pauses mid-conversation to summarize what has been established so far and then continues from that compressed state rather than dragging the full, growing transcript forward. The logic is straightforward: if degradation scales with context length and turn count, resetting the effective context should slow the decline. In principle, such a mechanism could also prune away off-topic digressions and keep the model focused on task-relevant information, improving both accuracy and latency by reducing the amount of text the system must process on every turn.

The challenge is that any summarization step introduces its own risk of information loss. If the model drops a detail during the refresh, later turns will be working from an incomplete picture. Designing a refresh mechanism that preserves all task-relevant facts while trimming conversational noise is itself a hard problem, and it is not clear that current architectures can do it reliably. No published study in the cited work demonstrates a specific percentage improvement from context refresh strategies in multi-turn function-calling tasks, so the idea remains a hypothesis rather than a proven fix. Still, the scale of the documented degradation suggests that even imperfect mitigations, such as selective summarization of tool outputs or periodic restatement of user goals, could deliver meaningful gains in practice.

What This Means for Anyone Using AI Daily

For individual users, the practical takeaway is blunt: the longer a chat session runs, the less you should trust the output without independent verification. Starting a fresh conversation for each distinct task is a simple workaround that sidesteps much of the documented decline. When a topic evolves substantially (say, from outlining a document to debugging code), it can be safer to open a new thread and restate the core requirements rather than relying on the model to correctly interpret a long trail of prior messages. Power users who rely on extended sessions for coding, research, or data analysis should treat late-session answers with extra skepticism, especially when the model is pulling from earlier parts of the conversation to inform its response.

For businesses deploying AI assistants at scale, the implications are more serious. Customer service bots, internal knowledge tools, and automated workflow agents all depend on sustained accuracy across many turns. A 39% average performance drop means that enterprise deployments built on optimistic single-turn benchmarks may be systematically overpromising what their AI can deliver. In customer-facing settings, that gap can translate into wrong refunds, misrouted support tickets, or non-compliant responses that only surface after the fact. Internally, it can mean brittle automation that appears to work in testing but fails quietly once employees start issuing follow-up instructions and mixing natural language with structured tool calls.

The broader lesson is that the gap between demo performance and real-world reliability remains wide. AI companies have spent the past two years racing to improve headline benchmark scores, but the Microsoft Research findings show that the conditions under which those scores are earned bear little resemblance to how people actually use chatbots. Closing that gap will require not just better models but better ways of measuring what “better” means when the conversation does not end after one question. Until then, both everyday users and enterprises will need to treat long, complex chats not as a strength of current systems, but as one of their most significant points of failure, and design their habits, workflows, and safeguards accordingly.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.