NVIDIA pairs Rubin GPUs with groq lpu to cut latency, boost inference by 35x

13 April 2026

NVIDIA’s $20 billion move for the groq lpu has reshaped its AI inference roadmap, from GPU architecture choices to how Vera Rubin will serve real-time workloads at scale.

Summary

Inside NVIDIA’s $20 billion bet on Groq

On Christmas Day 2025, NVIDIA‘s acquisition of Groq was still just a phone call and a handshake. Founder Jonathan Ross — creator of both the LPU and Google’s original TPU — was at home when discussions began. However, three weeks later, $20 billion had changed hands and Ross had become NVIDIA’s Chief Software Architect.

The process started when Groq COO Sunny Madra contacted Jensen Huang after NVIDIA opened NVLink to ecosystem partners, asking if Groq could connect to it. Jensen replied: “Sure, why not?” They tested GPU–LPU workload splitting over Ethernet — no NVLink yet, just a proof of concept. It worked reliably enough that, within days of the demo, formal acquisition talks began.

The speed of the deal reflected an urgent strategic problem. NVIDIA bought a targeted answer to the decode bottleneck every GPU datacenter faces in LLM inference. Moreover, it aligned with NVIDIA’s broader GPU architecture evolution from H100 to Blackwell B200 and then to the Vera Rubin platform.

Why GPUs struggle with LLM token generation

Every LLM inference request has two distinct phases. Prefill is when the model ingests and processes the full input — prompt, context, or document — in parallel. Thousands of tokens, all attention heads firing at once. GPUs excel at this highly parallel workload.

Decode, however, is fundamentally different. The model generates output one token at a time, sequentially. Each new token depends on all previous tokens, so the process cannot be parallelized across the sequence. That said, this sequential nature exposes the architectural mismatch between GPUs and token-by-token generation.

During each decode step, the GPU pulls the entire weight matrix from HBM memory, performs a relatively small matrix multiply, and outputs a single token. The weights do not change between steps, so the same data is loaded repeatedly. Even at 22 TB/s HBM4 bandwidth on a Vera Rubin GPU, memory movement becomes the bottleneck because compute per step is so light.

GPUs are optimized for massive parallelism, not fine-grained sequential execution. Consequently, each token-generation step leaves a large portion of GPU compute underutilized. This is exactly the problem the Groq 3 LPU was designed to solve, by taking over the part of inference that is bound by memory access and latency, not raw FLOPS.

Groq 3 LP30 design and SRAM-first architecture

The LP30 chip, at the heart of Groq 3, is architecturally very different from a GPU. Instead of high-bandwidth HBM sitting beside the die, it integrates 500 MB of on-die SRAM delivering 150 TB/s of bandwidth. A Vera Rubin GPU, by contrast, offers 288 GB of HBM4 at 22 TB/s. The LPU has far less capacity, but its memory access is almost 7x faster.

This design is deliberate. There is no cache hierarchy, no dynamic hardware scheduling and no speculative execution. Instead, the LP30 exposes flat SRAM with physical bank addressing. Moreover, it replaces the GPU’s runtime-driven execution model with compiler-driven static instruction scheduling, where every instruction’s timing is fixed at compile time.

That deterministic approach eliminates variance in token delivery timing, which is crucial for real-time voice agents and latency-sensitive applications. A GPU generating tokens sequentially experiences unpredictable microsecond-level delays between tokens, driven by memory pressure and scheduler state. The LPU avoids these jitters entirely by construction.

The LP30 includes four distinct execution unit types: Matrix (MXM) for dense FFN matrix multiplications, Vector (VXM) for pointwise operations and activations, Switch (SXM) for data permutation and transposition, and Memory (MEM) for direct SRAM addressing under compiler control. This mix is optimized for transformer FFN execution.

Each LP30 chip delivers 1.2 PFLOPS of FP8 compute. It connects to neighboring chips via 96 chip-to-chip links at 112 Gbps each, for a total of 2.5 TB/s of C2C bandwidth per chip. Built on Samsung’s LP4X process, it doubles SRAM capacity versus Groq’s first-generation chip, which had 230 MB.

Crucially, the LPU is not meant to replace GPUs. It is designed to sit alongside Vera Rubin GPUs in a split-execution model, with each processor type handling the part of the workload that matches its architecture. However, making that split efficient required a new system design.

Attention-FFN Disaggregation and system-level split

NVIDIA labels the architecture Attention-FFN Disaggregation (AFD). The core insight is that a transformer’s decoder layer contains two very different types of work, and different hardware wins at each: attention versus feed-forward networks.

Attention is context-dependent. It reads the KV cache — the buffer storing intermediate attention computations that grows linearly with sequence length — computes attention scores over all previous tokens, and produces intermediate activations. Compute scales with context length, so it needs hardware with large HBM capacity to hold the KV cache. That role falls to the Rubin GPU.

The FFN path, by contrast, is context-independent. The same weight matrices are used regardless of the specific prompt; only the incoming activations differ per token. Modern MoE models are dominated by FFN parameters: DeepSeek R1 has 669 billion FFN parameters out of 685 billion total, or 97.7%. Kimi K2 reaches 98.9%. These weights can fit in SRAM at rack scale, and loading them at 150 TB/s is vastly faster than 22 TB/s from HBM4.

In practice, the split works like this: a Vera Rubin NVL72 complex handles prefill, KV cache construction, and decode attention. The Groq 3 LPX system takes on decode FFN and MoE routing. For a model with 40 decoder layers, that implies 40 round trips per token, with activations flowing from GPU to LPU and back for each token generated.

All of this is orchestrated by NVIDIA Dynamo, a software layer that classifies incoming requests, routes prefill to GPU workers, manages the per-token AFD loop, and performs KV-aware scheduling so new tokens land on workers that already hold the relevant cache. Moreover, it coordinates networking so cross-rack traffic is minimized.

As Jonathan Ross explained during his GTC 2026 session, “If you run everything on the LPU, you’d be underutilizing it on attention. If you run everything on the GPU, you underutilize it on the FFN layers. Putting them together, the utilization goes up for both.” That comment neatly summarizes the logic behind the nvidia groq acquisition.

In short, Rubin GPUs handle what demands large HBM — attention over long contexts and KV cache storage. Groq’s LPU handles what needs extreme memory bandwidth and deterministic timing — FFN weights delivered at 150 TB/s for every token.

Groq 3 LPX rack design and performance envelope

The AFD model only works if the Groq 3 LPX scales far enough to cover the FFN workload. At 256 LP30 chips, it reaches that scale. The LPX rack integrates 256 chips across 32 liquid-cooled compute trays, with 8 chips per tray in an all-to-all topology to keep intra-rack latency low.

Rack-level specifications are substantial. Total SRAM across the 256 chips is 128 GB. Aggregate rack-scale bandwidth reaches 640 TB/s. FP8 compute tops out at 315 PFLOPS. Inter-rack connectivity provides roughly 448 GB/s per direction per tray, using a Dragonfly topology with a maximum three-hop diameter.

When paired with Vera Rubin NVL72, this platform delivers dramatic efficiency gains. According to NVIDIA, the combined Vera Rubin NVL72 + LPX system achieves 35x higher inference throughput per megawatt than a Grace Blackwell NVL72 system alone. Importantly, the comparison is against Blackwell NVL72, not older H100 configurations.

During the GTC keynote, Jensen Huang showcased interactive targets of 800–1,000 tokens per second for real-time workloads with this combined system. However, one crucial clarification from the same event is that the LPX rack is not a cheaper alternative to NVL72.

When an audience member suggested that LPX might be less costly, Ross pushed back: “I don’t actually agree that it’s less costly. There’s a lot of silicon in that rack.” The LPX is framed as a premium extension to the Vera Rubin platform for customers who need ultra-low latency inference at scale, not as a budget inference card.

There is also a hard cap on model size. With 128 GB of SRAM, an LPX rack can host models up to roughly 53 GB in FP8. Larger models such as DeepSeek R1, with 623 GB worth of FFN weights in FP8, require chaining multiple LPX racks via front-panel C2C ports and distributing FFN weights across racks. That said, Dynamo and NVIDIA’s networking stack are designed to hide much of this complexity from end users.

Rubin CPX cancellation and Vera Rubin integration

The Groq acquisition also solved a product-planning conflict. Initially, the Vera Rubin roadmap included a second GPU variant called Rubin CPX, a GDDR7-based chip optimized for long-context and KV cache workloads. At GTC 2026, an attendee asked Ross directly whether Rubin CPX had been canceled.

Ross sidestepped with “one question per person” and did not give a direct yes or no. However, in a separate Ian Buck press Q&A, NVIDIA confirmed that CPX has been shelved and that LPU-based decode will ship instead this year. The LPU solves the same context-independent, memory-capacity inference problem using dedicated SRAM rather than a GPU variant.

As a result, Vera Rubin integration now standardizes on Rubin GPUs for attention and KV cache tasks, while LPX racks assume responsibility for FFN execution and MoE dispatch. Moreover, this simplifies the product stack: instead of proliferating GPU SKUs, NVIDIA leans on a clear GPU–LPU division of labor.

Frequently asked questions on LPU and Rubin

Why not just deploy more Rubin GPUs?

Adding more Vera Rubin GPUs increases aggregate batch throughput — more parallel requests can be served simultaneously. However, it does not improve decode latency for a single request, because token generation remains inherently sequential regardless of GPU count. The LPU is designed to reduce per-request latency, not just system-wide throughput, making the two approaches complementary.

How is the 35x throughput improvement measured?

The 35x figure is a system-level metric. It compares a full Vera Rubin NVL72 plus Groq 3 LPX deployment against a Grace Blackwell NVL72-only system, measured in tokens per second per megawatt. It is not a per-GPU improvement, and the baseline explicitly excludes smaller configurations like B200x8 or H100x8.

Where is the GPU-to-LPU bottleneck?

Rack-to-rack latency is a real consideration. A Microsoft engineer raised this point directly during the GTC session. Ross acknowledged the concern and said networking improvements are in progress. Meanwhile, Dynamo handles KV-aware routing to minimize unnecessary transfers, and NVIDIA’s Mellanox networking stack manages inter-rack links to keep latency as low as possible.

Can the LPU train models?

No. The LPU is strictly decode-only. Its compiler-driven static scheduling, which makes FFN execution so fast and deterministic, is poorly suited to the dynamic and iterative nature of backpropagation. Training workloads therefore remain on Rubin GPUs, which are architected for flexible, high-throughput compute.

When will Groq 3 LPX be available?

Ross confirmed at GTC that Groq 3 LPX is already in production. Jensen Huang announced Q3 2026 as the timeline for customer availability, aligned with the broader Vera Rubin platform rollout. Ross described the journey from Christmas Day discussions to the GTC stage as “probably one of the fastest semiconductor ramps in history,” spanning roughly three months.

Strategic implications for AI inference

The groq lpu architecture fills a gap no GPU variant could close, not because GPUs lack transistors or memory, but because sequential, latency-sensitive FFN decode misaligns with their parallel design. As user volumes grow, more concurrent requests hit each datacenter, and every millisecond of per-token latency compounds across thousands of sessions.

The acquisition signaled NVIDIA’s recognition that GPUs alone are not sufficient for the decode side of LLM inference. Moreover, it confirmed a broader industry shift: specialized accelerators are emerging to complement general-purpose GPUs, not to supplant them entirely.

Compiler-driven SRAM at 150 TB/s is the right tool for FFN execution. HBM4 at 22 TB/s is the right tool for attention over long contexts and KV-heavy workloads. The $20 billion bet is that using both together — orchestrated by Dynamo and connected via NVLink and high-speed C2C links — creates a system that neither could match alone.

Based on NVIDIA’s GTC 2026 announcements and official documentation, Groq 3 LPX is targeting Q3 2026 availability. If the ramp stays on schedule, the combined Rubin–LPU stack could define the reference architecture for large-scale AI inference in the second half of the decade.

NVIDIA pairs Rubin GPUs with groq lpu to cut latency, boost inference by 35x

Inside NVIDIA’s $20 billion bet on Groq

Why GPUs struggle with LLM token generation

Groq 3 LP30 design and SRAM-first architecture

Attention-FFN Disaggregation and system-level split

Groq 3 LPX rack design and performance envelope

Rubin CPX cancellation and Vera Rubin integration

Frequently asked questions on LPU and Rubin

Why not just deploy more Rubin GPUs?

How is the 35x throughput improvement measured?

Where is the GPU-to-LPU bottleneck?

Can the LPU train models?

When will Groq 3 LPX be available?

Strategic implications for AI inference

Will Binance Wallet perps ignite a $1,000 Alpha Points rush and shift on-chain volumes?

Paxos Labs lands $12M to deliver white-label enterprise stablecoin rails

Amazon Apple satellite: $11.57B Globalstar deal reshapes Leo vs Starlink

Featured video

LATEST

Booking breach hits platform with 6.8B bookings since 2010; phishing risk climbs

Claude code performance under scrutiny after viral 67% drop claim.

Kraken extortion hits 2,000 accounts, insiders at risk as firm tightens controls?

Could the zuckerberg ai clone proxy in Meta meetings reshape internal talks?

Don't miss out on any updates