Morning Overview

LLMs are bad at explaining their own inner workings

Recent research has cast a spotlight on the limitations of large language models (LLMs) in explaining their own internal processes. A study from Anthropic reveals that these artificial intelligence systems are “highly unreliable” when it comes to articulating their own thought processes, raising significant questions about transparency in AI systems and posing challenges for the future of AI interpretability.

Overview of LLM Internal Processes

Large language models, despite their impressive ability to generate human-like text, operate in a manner that is often likened to a black box. The internal decision-making processes that guide their outputs remain largely opaque. These models generate responses through complex, layered neural networks, but they lack inherent mechanisms for self-explanation. This lack of transparency has led to the observation that LLMs demonstrate a “highly unreliable” capacity to describe their own internal processes.

Anthropic’s Research Approach

The Anthropic study delved into the self-reporting capabilities of AI, using a series of probes to investigate how these systems articulate their thought processes. The researchers conducted specific experiments to test the ability of AIs to explain their reasoning steps during tasks. This study, published on November 4, 2025, forms part of Anthropic’s broader efforts to improve AI interpretability.

Key Evidence of Unreliability

The Anthropic study uncovered significant inconsistencies in how AIs describe their own thought processes. For example, there were instances where LLMs provided fabricated or mismatched accounts of their internal logic. These findings underscore the “highly unreliable” capacity of LLMs to describe their own internal processes, as documented in the analysis from November 3, 2025.

Implications for AI Transparency

The inability of AIs to reliably explain their own thoughts has significant implications for trust in AI applications, particularly those used as decision-making tools. Regulatory oversight also faces challenges, given the findings of the Anthropic study. Furthermore, there are potential risks in high-stakes fields, such as healthcare or finance, where the consequences of decisions based on unreliable AI explanations could be severe.

Comparisons with Existing AI Research

Anthropic’s results echo previous studies on neural network interpretability, which have also highlighted discrepancies between LLM outputs and actual internal computations. The findings from the November 4, 2025, study align with earlier claims about AI self-awareness, further emphasizing the “highly unreliable” capacity of LLMs to describe their own internal processes.

Expert Perspectives on the Study

AI researchers have expressed concern over the revelation that AIs can’t reliably explain their own thoughts. Some believe that the Anthropic study’s findings will have a significant impact on future model design. The study has also sparked broader discourse around the limitations of LLMs and the need for improved transparency in AI systems.

Paths Forward for Improvement

Despite these challenges, there are potential paths forward for improving the self-reporting capabilities of LLMs. Techniques such as enhanced training could potentially improve AI explainability. The Anthropic study has also inspired ongoing initiatives aimed at overcoming the limitations identified in the research. The long-term goal is to develop AI systems that can not only perform complex tasks but also reliably explain their own internal processes.

More from MorningOverview