A growing body of research now quantifies what many chatbot users have suspected for months: the longer a conversation runs, the worse the AI performs. Large-scale simulations comparing single-turn and multi-turn exchanges across six types of generation tasks found an average 39% performance drop once models had to sustain a dialogue, with failures traced to early assumptions, premature solution attempts, and over-reliance on recent inputs. The findings carry real stakes as chatbots expand into medicine, mental health counseling, and customer service, where a wrong answer on turn 12 can matter far more than a wrong answer on turn one.
A 39% Drop When the Conversation Keeps Going
The clearest evidence comes from a paper titled “LLMs Get Lost In Multi-Turn Conversation,” which ran large-scale simulations pitting single-exchange prompts against extended dialogues. Across six generation task types, models showed an average 39% accuracy decline in multi-turn settings. The failure modes were specific and repeatable: models locked onto assumptions made in early turns, attempted solutions before gathering enough information, and leaned too heavily on the most recent user message while discounting earlier context. This is not a matter of running out of memory or hitting a token ceiling. The models had the technical capacity to process the full conversation. They simply stopped using it well.
Separate benchmark testing reinforced this pattern. Research published under the Multi-IF framework measured how well models follow instructions as a conversation progresses, testing performance across turns and across languages. The results showed measurable accuracy declines from turn one to later turns, with a cited model’s average accuracy dropping as the dialogue lengthened. The degradation was not limited to English; instruction-following weakened in multilingual settings too, suggesting the problem is structural rather than language-specific and that current training methods do not reliably teach models to maintain stable goals over time.
Why Models Forget What You Said Five Minutes Ago
The technical roots of this problem trace back to a well-documented phenomenon in how language models process long text. A foundational study on long-context behavior showed that performance degrades when relevant information sits in the middle of a long input, rather than near the beginning or end. Models exhibit a U-shaped attention curve, leaning on early and late tokens while undervaluing everything in between. In a conversation that runs 10 or 15 turns, the user’s most important clarification or constraint often lands right in that neglected middle zone, where the model’s internal weighting effectively treats it as background noise.
A more recent paper frames this as “Lost in Conversation,” or LiC, and argues the core issue is not simply a capability gap but an intent-alignment problem. As a dialogue unfolds, the model’s internal representation of what the user wants drifts further from the user’s actual goal. Each new turn introduces small mismatches between the model’s assumptions and the human’s intent, and those mismatches compound. The researchers propose architectural changes and alignment strategies designed to periodically re-anchor the model’s understanding, such as explicit intent summaries or specialized modules that track user goals across turns, though none have been widely adopted in commercial products yet.
Even brief pushback mid-conversation can destabilize a model’s reasoning. In a controlled two-round experiment, researchers simply asked models “Are you sure?” after an initial answer. The models frequently flipped their responses, and overall accuracy deteriorated from the initial answer to the final one. That finding points to a deeper fragility: models treat conversational pressure as a signal to change course, even when their first answer was correct. In practical terms, users who instinctively challenge or double-check an AI’s output may unwittingly push it toward worse performance rather than better, especially if the system lacks explicit mechanisms for confidence estimation or self-consistency checks.
Long Context Does Not Mean Better Context
One common industry response to these problems has been to expand context windows, the amount of text a model can process at once. But longer windows do not automatically translate into better performance. A study that tested long-context models on in-context learning tasks selected six datasets with label ranges spanning 28 to 174 classes, covering different input configurations for few-shot demonstrations. The results showed that models still struggled to use the additional context effectively, particularly when the number of classes and examples grew large. Bigger windows gave models more information to work with, but the same attention biases meant they often ignored the most relevant parts of it and failed to generalize reliably from the scattered examples.
This distinction matters because AI companies have marketed expanded context windows as a selling point, implying that users can have longer, richer conversations without losing quality. The research tells a different story. More context creates more opportunities for the model to lose track of what matters, especially when the conversation involves back-and-forth clarification rather than a single, well-structured prompt. For users who rely on chatbots for complex tasks like drafting legal documents, debugging code, or planning travel itineraries, the practical advice is counterintuitive: shorter, more focused exchanges tend to produce better results than marathon sessions. Chunking a task into discrete, well-labeled prompts, and periodically restating key constraints, can help counteract the model’s tendency to drift.
Real-World Failures in Medicine and Mental Health
The consequences of multi-turn degradation become sharper in high-stakes domains. A report from the University of Nebraska Medical Center found that AI chatbots falter 56 percent of the time when handling real-world medical questions from actual patients, despite performing well on textbook-style exam prompts. The gap between controlled testing and messy, multi-turn patient interactions is exactly the kind of degradation the research predicts. A patient who asks a follow-up question, corrects a misunderstanding, or adds new symptoms mid-conversation is effectively pushing the model into the zone where it performs worst, raising the risk of incomplete or misleading advice.
Mental health applications face similar risks, but with added ethical weight. Ellie Pavlick, a Brown University computer scientist who studies language models, has warned that using chatbots in therapy-like settings demands far more rigorous oversight and a clear understanding of their limits; in a university discussion of mental health ethics, she emphasized that models are not designed to reliably track a person’s evolving emotional state or history across long conversations. Multi-turn drift means a chatbot might respond appropriately to an initial disclosure of distress, then gradually normalize or even encourage risky behavior as the dialogue continues and earlier red flags fade into the middle of its context. For vulnerable users, that kind of inconsistency can be more dangerous than a single, obviously wrong answer.
Designing Conversations Around Known Weaknesses
The emerging research does not suggest that multi-turn conversations are impossible, but it does imply that current systems need guardrails tailored to their weaknesses. Product designers can mitigate degradation by structuring interfaces that encourage users to restate goals, confirm constraints, and periodically summarize progress. Simple design patterns, such as automatic recap messages that list agreed-upon requirements, or prompts that ask users to verify the model’s understanding before proceeding, directly counter the documented drift in intent alignment. On the model side, training regimes that explicitly reward consistency across turns, rather than just single-turn correctness, could help close the gap between benchmark scores and lived user experience.
For now, the safest approach is to treat chatbots as powerful but brittle tools whose reliability declines as conversations grow longer and more complex. Users can protect themselves by breaking big problems into smaller segments, checking critical facts against independent sources, and ending threads that seem to have gone off course instead of trying to “rescue” them with more prompts. Regulators and professional bodies in medicine, mental health, and law will likely need to incorporate these findings into guidelines that limit unsupervised, high-stakes use of AI assistants. Until models are explicitly engineered and tested for multi-turn robustness, the evidence suggests that the smartest way to talk to them is also the simplest: one careful question at a time.
More from Morning Overview
*This article was researched with the help of AI, with human editors creating the final content.