ChatGPT can speed learning but may weaken long-term retention

A growing body of experimental research suggests that using ChatGPT can boost short-term performance on some tasks, but may come with weaker retention weeks later. In one randomized controlled trial of 120 undergraduates, unrestricted ChatGPT users scored 57.5% on a surprise retention test 45 days after studying, while students who worked without the tool scored 68.5%. That 11-percentage-point gap raises a pointed question for educators and learners alike: does easy access to AI answers reduce the mental effort that helps cement knowledge?

The 45-Day Retention Gap

The clearest evidence comes from a randomized trial published in Social Sciences and Humanities Open. Researchers assigned 120 undergraduates to either unrestricted ChatGPT use or no-AI study conditions, then administered a surprise test 45 days later. ChatGPT users averaged 57.5% correct answers; non-users averaged 68.5%. The authors reported the difference as statistically significant. The authors describe unrestricted AI use as a “cognitive crutch” that reduces the effortful processing required to move information from short-term to long-term memory.

A separate experiment reinforces the pattern from a different angle. In a separate preprint, researchers studied 123 students across four conditions: ChatGPT, Google search, an e-textbook, and a no-tool control. Tasks were mapped to Bloom’s Taxonomy, ranging from basic recall to higher-order analysis. The authors report that both ChatGPT and Google improved immediate performance on lower-order tasks, but that advantage was smaller on later follow-up testing. The implication is consistent: tools that supply quick answers help students complete assignments but do not necessarily help them learn the material in a durable way.

Structured AI Tutoring Tells a Different Story

The retention penalty appears tied to how students use AI, not to AI itself. When ChatGPT is constrained to a tutoring role rather than an answer machine, the results look different. A study of 274 participants in PLOS ONE compared ChatGPT-generated math hints against human-authored hints and a control group across multiple topics. The study reported statistically significant learning gains in certain conditions, with ChatGPT-generated hints performing on par with hints written by human tutors. The key distinction is that hints guide students toward answers rather than handing them over, preserving the cognitive work that builds retention.

Research published in npj Science of Learning pushes this further by measuring what happens inside the brain. In a biology learning environment, students were randomly assigned to receive different types of chatbot feedback: metacognitive prompts, affective encouragement, or neutral responses. Using functional near-infrared spectroscopy (fNIRS), researchers tracked brain activity during the sessions and then tested retention and transfer. The study found meaningful differences in both learning outcomes and neural engagement depending on feedback type. Metacognitive prompts, which push students to reflect on their own thinking, appear to activate deeper processing than simple affirmative feedback.

Brain Scans and Exam Scores Point to “Cognitive Debt”

The neurological dimension of this problem extends beyond classroom chatbots. A preprint affiliated with MIT Media Lab used EEG to monitor brain activity while participants wrote essays under three conditions: with a large language model, with a search engine, and with no digital assistance at all. The study ran multiple sessions over several months and included a switch condition where some participants moved from LLM-assisted writing to unassisted writing and vice versa. Researchers reported lower brain connectivity during AI-assisted sessions, a finding they describe as accumulating “cognitive debt.” In the switch condition, the authors interpret the post-switch patterns as consistent with reduced independent engagement after relying on the LLM.

Academic performance data from real classrooms aligns with these laboratory findings. An analysis of student essays used generative AI detection tools to classify likely users, then compared their exam outcomes using multivariate regression. Students classified by the paper’s detection approach as probable generative AI users scored roughly 6.71 points lower out of 100 on average than non-users. This is not a randomized experiment in the same way as the 45-day retention trial, but it captures a real-world signal: students who offload substantial portions of their writing to AI tools tend to perform worse when tested without those tools.

What the Broader Evidence Shows

A recent synthesis in Humanities and Social Sciences Communications pooled experimental and quasi-experimental research on ChatGPT in education. The meta-analysis quantified effects on learning performance, student perceptions, and higher-order thinking skills. Results were generally positive for basic learning performance but mixed for higher-order thinking, with significant variation depending on moderators such as grade level and course type. That pattern fits the emerging consensus: AI tools can help students absorb facts and complete tasks, but the gains do not automatically extend to the analytical and creative thinking that defines deep learning.

One gap in the current evidence deserves attention. Nearly all of the controlled studies involve college-aged participants in STEM or social science courses. There is limited primary data on how younger students, vocational learners, or those in humanities-heavy programs respond to AI-assisted study. The longest follow-up window in the controlled literature is 45 days, which means the field still lacks evidence on whether retention deficits compound over a full semester or academic year. These are not reasons to dismiss the findings, but they define the boundaries of what researchers can confidently claim right now and highlight where future work is most needed.

Practical Strategies to Protect Retention

The research points toward a clear design principle: AI tools should prompt students to think, not simply deliver polished answers. In practice, that means configuring systems so that the default behavior is to ask questions, request intermediate steps, and encourage self-explanation. For example, rather than pasting a problem into ChatGPT and asking for a solution, students can request a sequence of guiding questions or hints, mirroring the scaffolding used by human tutors in the math-hints experiments.

Educators can reinforce this by building AI use into course policies and assignments instead of trying to ban it outright. In problem sets, instructors might require students to submit both their initial reasoning and any AI-generated hints they used, along with a short reflection on what changed in their understanding. This aligns with the metacognitive prompts that boosted neural engagement in the fNIRS study and helps ensure that AI interaction becomes part of the learning process rather than a shortcut around it.

Assessment design also matters. When high-stakes exams are conducted without AI, but most practice work is done with unrestricted tools, students can drift into the kind of cognitive debt seen in the EEG research. Blending closed-book quizzes, oral explanations, and in-class problem solving with AI-augmented homework creates a more consistent cognitive environment. The goal is to keep students regularly practicing retrieval and reasoning without automated support so that those skills remain fluent when it counts.

Institutions and researchers have a role to play in shaping this next phase. Platforms like arXiv’s member network help disseminate early findings on AI and learning, including preprints that still need peer review. As more long-term, diverse, and classroom-embedded studies appear, educators will be better positioned to calibrate when AI should act as a calculator, when it should act as a coach, and when it should be switched off entirely.

The emerging message is neither alarmist nor complacent. Unrestricted answer-giving appears to carry a measurable retention cost, especially over weeks-long timescales. Yet carefully structured AI tutoring, built around hints and metacognitive prompts, can match human guidance and even deepen engagement. The challenge for schools and students is to move beyond the novelty of instant answers and instead design AI use around the slow, effortful work that real learning still requires.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X

ChatGPT can speed learning but may weaken long-term retention

The 45-Day Retention Gap

Structured AI Tutoring Tells a Different Story

Brain Scans and Exam Scores Point to “Cognitive Debt”

What the Broader Evidence Shows

Practical Strategies to Protect Retention

Author

Get weekly updates with the latest news and tips!

More in AI

IG

FB

PIN

LI

X