Taalas, a Finnish AI company, has reportedly moved away from NVIDIA GPUs in favor of hardwired AI chips, claiming inference speeds of 17,000 tokens per second. The shift coincides with a broader industry push toward specialized silicon for AI workloads, highlighted by Cerebras Systems tripling its own inference performance to set a new speed record. Together, these developments signal that the GPU’s long reign as the default AI accelerator faces a serious challenge from purpose-built hardware.
Cerebras Sets a New Inference Speed Record
The clearest benchmark for this hardware transition comes from Cerebras Systems, which announced it had tripled its inference performance, achieving 2,100 tokens per second on Meta’s Llama 3.2 70B model. That figure was verified independently by Artificial Analysis, a third-party benchmarking organization that measures inference throughput across providers. By leaning on an outside evaluator rather than self-reported numbers, Cerebras put a hard floor under its claim, making it the fastest publicly benchmarked inference result for a model of that size at the time of the announcement and raising the bar for transparency in performance marketing.
The 2,100 tokens-per-second figure matters because it represents a threefold jump from the company’s prior record, not an incremental gain. Tokens are the basic units of text that large language models process, and higher throughput translates directly into faster responses for end users and more efficient utilization of hardware. At that speed, a system could generate a full page of text in roughly one second, a pace that makes real-time conversational AI, code generation, and high-volume document processing far more practical. The result also reframes the competitive conversation: GPU-based inference clusters from major cloud providers have generally operated well below that threshold for models in the 70-billion-parameter class, which means specialized chips are no longer just theoretical alternatives but measurable leaders in raw speed for certain workloads.
Why Taalas Chose Hardwired Silicon Over GPUs
Taalas reportedly took the specialized-hardware argument a step further by building its inference pipeline around hardwired AI chips rather than general-purpose GPUs. The company’s claimed throughput of 17,000 tokens per second, if accurate, would represent a dramatic leap beyond even the Cerebras benchmark and suggest an order-of-magnitude advantage over conventional GPU clusters for similar model sizes. However, no primary press release from Taalas or independent third-party verification of that specific figure has surfaced in available reporting. Without confirmation from an organization like Artificial Analysis or an equivalent benchmarking group, the 17,000-token claim should be treated with caution, and the exact hardware configuration behind it remains unclear based on current public sources.
What is clear is the logic driving the switch. GPUs were originally designed for graphics rendering and later adapted for AI training and inference through software frameworks like CUDA, ROCm, and other toolchains. That flexibility comes at a cost: power consumption, memory bandwidth bottlenecks, and the overhead of running general-purpose silicon on a narrow set of operations. Hardwired chips, by contrast, are designed from the transistor level to execute specific AI math, particularly the matrix multiplications and attention mechanisms that dominate inference workloads. The tradeoff is reduced versatility (these chips are less suitable for arbitrary computation or rapid model-architecture experimentation), but for companies whose entire business depends on serving AI predictions at scale, that tradeoff can be worth it. Taalas appears to have bet that inference speed, predictability, and energy efficiency matter more than the ability to retrain or radically reconfigure models on the same hardware stack.
The Economics Pushing Companies Beyond GPUs
The financial pressure behind these hardware decisions is straightforward. As generative AI models grow larger and demand for real-time inference climbs, the cost of running GPU clusters has become a significant line item for AI companies. High-end data center GPUs carry steep acquisition costs, and the electricity required to run and cool them adds an ongoing operational burden that compounds as usage scales. For a mid-tier firm like Taalas, which does not have the capital reserves or volume discounts of a hyperscaler, finding a cheaper path to high throughput is not just an engineering preference but a survival strategy. Every marginal improvement in tokens per second per dollar of hardware can translate into better margins, more competitive pricing, or both.
Hardwired chips offer a potential escape from that cost spiral. Because they eliminate much of the transistor overhead associated with general-purpose computing, they can deliver more operations per watt, which directly reduces electricity bills and cooling requirements. Cerebras has built its approach around a single wafer-scale chip that avoids the communication latency and energy overhead of linking multiple smaller GPUs together, a design choice that contributes to the speed gains reflected in its benchmark results. For companies deploying AI at the edge, where power and space constraints are even tighter, purpose-built silicon could unlock applications, such as on-device copilots or real-time analytics, that are simply impractical with GPU racks. The economic argument, in short, is not just about peak speed but about the total cost of delivering each token to an end user over the lifetime of a deployment.
What Independent Benchmarks Reveal and Conceal
The role of third-party measurement in this competition deserves scrutiny. Cerebras chose to have its results validated by Artificial Analysis, which lends credibility to the 2,100-token figure and sets a precedent for how performance claims can be substantiated. Independent benchmarking organizations apply standardized testing conditions, controlling for variables like batch size, model precision, prompt length, and network latency that vendors might otherwise optimize around in self-reported numbers. That kind of transparency is still relatively rare in an industry where performance claims often come with asterisks, footnotes, or narrowly defined scenarios that do not match real-world usage.
Yet benchmarks also have limits. A tokens-per-second figure for a single model under controlled conditions does not capture the full picture of production readiness. Real-world inference involves handling thousands of simultaneous requests, managing variable-length inputs, enforcing rate limits, and maintaining consistent latency under load while sharing resources with other services. A chip that excels on a clean benchmark run may behave differently when serving a live application with unpredictable traffic patterns, noisy neighbors, and complex orchestration layers. The absence of independent verification for Taalas’s claimed 17,000-token throughput is a gap that matters precisely because benchmarks are where marketing claims get tested. Until a recognized third party confirms that number under controlled conditions, it remains an assertion, rather than a proven result, and buyers will have to weigh that uncertainty against the more rigorously documented performance of competitors like Cerebras.
A Market Split Between Flexibility and Raw Speed
The broader takeaway from these developments is that the AI hardware market is splitting into distinct lanes rather than converging on a single dominant architecture. GPUs will likely remain the default choice for training large models, where their programmability, mature software ecosystem, and compatibility with a wide range of research workflows provide clear advantages. For organizations experimenting with new architectures, modalities, or training techniques, the ability to repurpose the same GPU clusters for many different tasks is hard to beat. Cloud providers also benefit from the fungibility of GPU capacity, which can be dynamically allocated among customers and workloads.
For inference, however, the task is narrower and the economics favor efficiency over flexibility, giving specialized chips room to carve out serious ground. Cerebras has demonstrated that purpose-built silicon can outperform GPU-based systems by a wide margin on standardized benchmarks, and companies like Taalas are placing real bets on that advantage translating into commercial viability. Over the next few years, enterprises may find themselves choosing between a flexible, GPU-centric stack that simplifies training and experimentation, and a heterogeneous architecture that pairs GPUs for training with hardwired accelerators for high-volume inference. In that scenario, the winners are likely to be vendors that can prove their numbers under independent scrutiny and customers that can accurately model not just peak performance, but the long-term cost and reliability of every token their systems generate.
More from Morning Overview
*This article was researched with the help of AI, with human editors creating the final content.