OpenAI hires startup Gimlet Labs to optimize its models for Cerebras chips — claiming 10x faster AI inference at the same cost

A startup called Gimlet Labs says it can split AI workloads across chips from different manufacturers and make inference up to 10 times faster without increasing cost or power consumption. On March 23, 2026, the company announced $80 million in Series A funding led by M12, Microsoft’s venture arm, and named OpenAI among the frontier labs using its technology to run models on hardware from Cerebras and other vendors.

If the performance numbers hold up, Gimlet could offer large AI companies something they have been chasing for years: a way to break free from dependence on a single chip supplier, primarily NVIDIA, without sacrificing speed.

What Gimlet Labs actually built

Gimlet’s software acts as an orchestration layer that sits between AI models and the physical chips running them. Instead of compiling a model for one type of processor, the system decomposes it into subgraphs and routes each piece to whichever accelerator handles it most efficiently. A matrix-heavy layer might land on a GPU optimized for that math, while a memory-intensive operation could be sent to a Cerebras wafer-scale engine, which packs an entire chip’s worth of processing onto a single silicon wafer.

The technical underpinnings are described in a preprint paper titled “Efficient and Scalable Agentic AI with Heterogeneous Systems,” authored by Gimlet co-founders Zain Asgar and Sachin Katti alongside Michelle Nguyen and posted on arXiv. The paper outlines three core mechanisms: MLIR-based compilation that breaks models into hardware-optimized fragments, cost models that predict which processor best suits each fragment, and dynamic placement strategies that adjust routing in real time based on telemetry from the running system.

MLIR, a compiler framework originally developed at Google and now widely used in the research community, gives the approach a credible technical foundation. It is not a novel concept on its own, but applying it to orchestrate inference across fundamentally different chip architectures at production scale would be a meaningful engineering achievement. The arXiv paper, it should be noted, has not undergone formal peer review.

The founders behind the pitch

Gimlet’s leadership brings heavyweight credentials. Zain Asgar previously worked at Google on infrastructure projects and has ties to Stanford’s AI research ecosystem. Sachin Katti is a Stanford computer science professor who also served as a vice president at Cisco, giving him experience bridging academic research and large-scale commercial systems. That combination of deep technical knowledge and industry connections likely helped attract the $80 million round and the attention of frontier labs.

The funding itself is notable in context. Cerebras, whose chips Gimlet claims to support, has raised over $4 billion and pursued an IPO. NVIDIA’s inference software stack, TensorRT-LLM, is the dominant tool most AI companies use today. For a Series A startup to position itself as the layer that sits above both ecosystems is ambitious, and the investor backing suggests at least some sophisticated backers believe the technical approach can work.

What OpenAI’s involvement actually looks like

The nature of OpenAI’s relationship with Gimlet Labs remains only partially clear. Gimlet’s press release lists OpenAI among the companies using its technology, but OpenAI has not issued its own statement confirming the arrangement, its scope, or its terms. There is no public indication of whether this involves a formal contract, a pilot program, equity, or something else entirely.

Similarly, Cerebras has not released a public statement verifying that its wafer-scale chips have been tested with or integrated into Gimlet’s orchestration software. The claim that Gimlet optimizes models specifically for Cerebras hardware comes from Gimlet’s own characterization, not from a joint announcement or independent validation.

This does not mean the relationships are fabricated. Startups routinely reference customers in funding announcements with those customers’ permission. But readers should understand that the specifics of how deeply OpenAI has adopted Gimlet’s tools, and on what hardware, are not yet documented in any public source beyond Gimlet’s own materials.

The 10x claim deserves a closer look

Gimlet’s press release states its software delivers “3 to 10X faster speed for the same cost and power envelope.” That is a wide range, and the company has not published the conditions under which the upper bound applies. The arXiv paper provides the theoretical methods but does not include benchmarks showing a 10x speedup on any specific model running in a production environment.

AI inference performance is notoriously sensitive to variables: model architecture, batch size, sequence length, memory bandwidth, network latency between processors, and scheduling overhead all play a role. A framework that achieves strong results in controlled experiments may behave differently when applied to OpenAI’s largest reasoning models at production scale, where tail latency and uptime requirements often matter more than peak throughput.

Independent benchmarks or third-party audits would go a long way toward validating the claim. Until those appear, the 3-to-10x figure should be understood as a company-reported number, not an independently verified one.

Why this matters beyond one startup

The bigger story here is not whether Gimlet Labs specifically delivers on every promise. It is whether the entire approach of heterogeneous inference, running a single AI model across chips from multiple vendors simultaneously, can become practical at scale.

Today, most frontier AI labs are deeply tied to NVIDIA’s ecosystem. The company’s GPUs, networking hardware, and software tools form an integrated stack that is difficult and expensive to leave. That concentration gives NVIDIA enormous pricing power and makes AI infrastructure costs one of the largest line items for companies like OpenAI, Anthropic, and Google DeepMind.

If orchestration software can genuinely abstract away the hardware layer, the competitive dynamics shift. Labs could mix and match GPUs, Cerebras wafer-scale engines, and future accelerators from companies like AMD, Intel, or custom chip designers. More hardware options would mean more price competition, which should drive down the per-query cost of running AI models.

For smaller AI companies and enterprise adopters, the stakes are different but related. Organizations that cannot afford to fill data centers with NVIDIA’s top-tier H100 or B200 GPUs could assemble mixed hardware clusters and still achieve competitive inference speeds. That would lower the barrier to deploying large models in production.

What independent verification would need to show

The evidence available as of June 2026 supports the technical plausibility of Gimlet Labs’ approach but stops short of validating its most aggressive claims. The funding confirms that serious investors are backing the vision. The arXiv paper demonstrates that the underlying ideas have been worked through with real technical depth. And the mention of OpenAI as a customer, even without full independent confirmation, signals that at least one major lab is engaged.

What will separate this from vaporware is what comes next: published benchmarks on named models and named hardware, customer case studies with verifiable metrics, and ideally third-party evaluations. The push to decouple AI software from any single chip platform is real and growing, but its success will be measured in production results, not press releases.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X