thomas vanhaecht/Pexels

Smartphone makers keep touting ever-faster neural chips, yet the AI features on those phones still feel oddly modest next to the sprawling chatbots and image generators in the cloud. The gap is not just about marketing hype, it reflects a deep mismatch between what today’s neural processing units are built to do and what modern AI models actually demand. To understand why, I need to follow the money, the physics and the software that sit between those glossy “AI phone” keynotes and the apps people actually use.

NPUs are racing ahead, but TOPS is a blunt instrument

On paper, the progress is staggering: every flagship launch now arrives with a bigger neural number, as if one more spec sheet could finally unlock sci‑fi on your lock screen. The standard yardstick for this race is TOPS, or “trillions of operations per second,” which has become the shorthand for how fast an NPU can crunch matrix math. Yet even the chipmakers concede that judging an NPU in TOPS is an incredibly reductive way to compare processing speeds, because it ignores memory limits, software stacks and the messy reality of real apps.

That disconnect helps explain why a phone that boasts triple‑digit TOPS can still struggle to run a modern language model smoothly. What matters is not just raw arithmetic, but how efficiently the chip moves data, schedules workloads and shares tasks with the CPU and GPU that sit beside it. As ex‑Apple engineers have pointed out, You have a CPU, GPU and an NPU in phones like Google Pixel and Apple iPhone, and each vendor wires those three differently. Until software can fully exploit that trio, the headline TOPS figure will keep overpromising on what on‑device AI actually delivers.

Big models, tiny phones: the physics problem

The other half of the story is that the AI people now expect on their phones is built for a very different environment. The most capable systems, from large language models to diffusion image generators, are trained and tuned on models including large language models with massive parameter counts that demand huge compute and memory for both training and inference. As computational linguist Graça Nunes notes, the fact is that currently, almost all NLP and AI problems can be solved with LLMs, but those systems rely on powerful servers and a lot of electricity.

Phones live under the opposite constraints. Mobile AI workloads are highly compute‑intensive and must operate within strict power and thermal limits, because pushing the silicon too hard will result in overheating and rapid battery drain. On top of that, without robust memory and storage, even the simplest AI models can be constrained by latency, power and bandwidth, which is why one chipmaker bluntly calls memory a strategic enabler of mobile AI. The physics of heat, battery and memory mean that no matter how fast the NPU gets, a phone will never behave like a rack of GPUs in a data center.

Software is lagging the silicon

Even where the hardware is ready, the software often is not. Adapting today’s neural networks to take full advantage of NPU capabilities requires specialized knowledge and development effort, and Adapting models to NPU architectures is still a niche skill. On iOS, Apple has tried to smooth that path with Core ML, which stands out as Apple’s flagship framework for machine learning, bundling tools that convert and optimize models for Apple devices. But even with Core ML, developers still have to redesign networks, prune parameters and juggle precision to hit mobile constraints.

Across platforms, the most successful AI infrastructure leans on specialized accelerators like TPUs or GPUs and highly optimized frameworks such as JAX, TensorFlow, PyTorch that maximize hardware utilization. Phones are only starting to get that level of tooling, and the fragmentation between Android vendors and iOS makes it harder for app makers to ship one well‑tuned on‑device model everywhere. Until the software ecosystem catches up, the NPU inside your phone will often idle while the CPU or GPU does work it could handle more efficiently.

Compression, quantization and the quality trade‑off

To squeeze big models into small devices, engineers lean on compression tricks that come with real trade‑offs. Quantization, which reduces the precision of weights and activations, is one of the most important, but research is blunt about its limits. However, quantization may degrade performance compared to full‑precision models as precision decreases, and While new methods try to learn better step sizes, preserving accuracy under aggressive compression remains a challenge.

That is why some on‑device assistants feel a generation behind their cloud cousins: they are, in a very literal sense, smaller and blurrier versions of the same brains. On‑device processing must work within the power and thermal limits of mobile devices, which restricts the size and complexity of models compared with AI systems running on dedicated server farms. The result is a constant balancing act between responsiveness, battery life and output quality, and for now, many phone makers are choosing to keep models small enough that they rarely wow users in the way cloud tools can.

Why on‑device AI still matters

Despite those limits, the industry is not chasing on‑device AI as a vanity project. Running models locally can be a game‑changer for latency and reliability, which is why Samsung’s chip division argues that the integration of on‑device AI is truly a game‑changer for mobile technology, minimizing lag and enhancing user experiences while unlocking the full potential of on-device AI. That matters for features like camera processing, live translation and voice commands, where even a half‑second round trip to the cloud can make the experience feel sluggish.

Privacy is the other major driver. Better privacy and security are central selling points for AI PCs, where a powerful NPU lets AI processing happen locally so sensitive data never leaves the device, ensuring that personal information stays under the user’s control. The same logic applies to phones. Privacy and Security benefits are strongest when Processing data locally means sensitive information never leaves the device, addressing growing privacy concerns and regulatory pressure. For health apps, financial tools and workplace messaging, that promise can matter more than having the flashiest generative model.

Edge AI is rising, but the cloud still dominates

For years, AI discourse has been dominated by massive cloud‑based models trained on enormous datasets and running in centralized data centers, and that gravitational pull has shaped what users expect from “AI.” Even as companies talk up edge computing, the biggest breakthroughs still arrive as cloud services that phones tap into over a network. Analysts describe how AI discourse has been dominated by this centralized model, even as intelligence starts to spread to the edge.

At the same time, the economics of cloud AI are daunting. Training the most advanced models is extraordinarily expensive and time‑consuming, costing tens or hundreds of millions of dollars, and Training the largest systems makes jurisdiction‑specific variants economically infeasible. Despite this popularity, training and deploying these models in the cloud can still be challenging, and aligning them with business objectives is time-consuming and compute-intensive. That cost pressure is one reason vendors are so eager to offload some inference to phones, even if the user experience is not yet on par with the cloud.

Developers are still figuring out how to use NPUs

From an app maker’s perspective, on‑device AI is not a free upgrade, it is a design problem. On‑device AI processing keeps everything local, which is faster and works offline, but a phone’s processor needs to handle all the heavy lifting, as developers are reminded when they try to bolt a chatbot onto a messaging app. Furthermore, these endeavors to enhance optimization have proven to be time‑consuming and labor‑intensive, often requiring several iterations for design and deployment to be accomplished, as researchers note when they tune deep networks for constrained systems.

That complexity is compounded by the fact that the NPU is designed to provide a base level of AI support extremely efficiently, but is generally limited to small, persistent solutions that still need to have long battery life, as one analysis of The NPU role in devices puts it. That makes NPUs perfect for always‑on tasks like wake‑word detection or background photo enhancement, but less suited to the bursty, heavyweight workloads of a full generative model. Until developer tools make it easier to split work intelligently between CPU, GPU and NPU, many apps will stick to simpler, more predictable uses of on‑device AI.

Local AI is improving, just not evenly

Despite the frustrations, there are signs that on‑device intelligence is quietly getting better. Artificial intelligence is moving towards an On‑Device platform from the cloud platform for better reliability, more privacy and lower latency, and researchers have already built pipelines like Fontnet that can run on a resource constraint mobile platform. In practice, that shift shows up in features like offline transcription, on‑device spam filtering and camera modes that no longer need a network connection to work.

The broader Local AI ecosystem is also maturing. Analysts argue that the future of Local AI is bright, driven by advancements in dedicated hardware and privacy‑preserving technologies that make it a fundamental component of modern digital products. The local AI landscape is evolving quickly, and While today’s tools may feel a bit raw, experts expect them to be dramatically better in just a year. That uneven pace helps explain the user experience: some features, like photo processing on a 2025 flagship, feel magical, while others, like on‑device chat, still lag.

Hybrid AI is the realistic endgame

Given all these constraints, the most plausible future is not a phone that replaces the cloud, but a partnership between the two. The most likely outcome? A hybrid future where AI and human models coexist, with AI systems dominating certain sectors while humans retain a stronghold in areas requiring authenticity and relatability. The same hybrid logic is emerging in infrastructure: With the advanced capabilities of on‑device AI, a hybrid AI architecture can scale to meet enterprise and consumer needs, providing better cost, energy, performance, privacy, security and personalization than either side alone, as one white paper on With the hybrid AI future puts it.

Phone platforms are already moving in that direction. Apple’s latest strategy gives developers free access to on‑device large language models, and analysts expect that move will accelerate the trend of hybrid AI architectures, where on‑device models handle common tasks and cloud models are reserved for more complex or data‑intensive queries. In parallel, PC makers explain that What on‑device AI means in relation to NPUs is that machine learning tasks run locally on a device’s NPU instead of in the cloud, handling things like processing images or audio directly and instantly. As that pattern spreads to phones, the most satisfying AI experiences are likely to be the ones users barely notice, where the NPU quietly handles the routine and the cloud steps in only when it truly adds something new.

The hype cycle is ahead of the user experience

There is also a timing problem. Investors and marketers sprinted ahead of the technology, promising a revolution before the application layer was ready. Analysts now argue that, as with most new technologies, it takes time for generative AI to mature and reach application-layer development and distribution. On phones, that lag shows up as a mismatch between the bold “AI phone” branding and the relatively incremental features that ship at launch.

Some critics worry that the imbalance of power in AI will only deepen as the biggest players control both the cloud models and the chips in people’s pockets. Author Gary Rivlin has warned that Creating these models concentrates so much power in the hands of a few people, a concern he voiced when discussing Creating large AI systems and deregulation. For now, though, the more immediate frustration for users is simpler: phone NPUs really are getting better, but until the physics, software and business incentives line up, the AI they enable will keep feeling like a work in progress rather than the leap the marketing promises.

More from MorningOverview