- Published on
Groq and SambaNova — Chips That Went All In on Inference
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- 0. Terms First — LPU, RDU, and ASIC
- 1. Why Inference-Only Chips
- 2. Groq LPU — The Idea of Deterministic Execution
- 3. SambaNova RDU — Reconfiguring the Dataflow
- 4. Comparing Against GPUs — Latency, Throughput, Cost
- 4.5. Same Inference, Different Chips — One Picture
- 5. Software and the Compiler
- 6. Which Workloads Fit, and the Limits
- 6.5. Why Tokens Per Second Matters — An Intuitive Calculation
- 7. Market Positioning — The 2026 Picture
- 8. The Developer's View — What to Judge By
- 8.5. The Larger Trend — Why Inference Chips Now
- 9. The Two Phases of LLM Inference — Prefill and Decode
- 10. What Deterministic Execution Means for Operations
- 11. Dataflow vs von Neumann — A Deeper Contrast
- 12. Quantization and Precision — A Shared Weapon of Inference Chips
- 13. Summarizing the Two Companies' Strategic Differences
- 14. Frequently Asked Questions
- 15. The Core at a Glance
- Closing
- References
Introduction
For the past decade, the AI hardware story was essentially a "training GPU" story. The race to train bigger models faster pulled the market along. Yet in 2026 the center of gravity is shifting. Models are already smart enough; now the question is "how cheaply and quickly can you serve them."
In numbers, 2026 is described as the inflection point where inference capex first overtakes training capex. That is the backdrop for cloud providers pouring out their own inference ASICs and the rapid rise in market share of inference-only chips.
If training is "a one-time investment to build a model," inference is "the operating cost incurred every day while you use that model." The more widely a model is used, the more inference cost accumulates, so a sliver of inference efficiency translates into enormous cost savings. This economics underpins the rise of inference-only chips.
The protagonists of this article are two companies at the vanguard of that shift. Groq wields the LPU (Language Processing Unit) for "deterministic, extremely low latency," and SambaNova wields the RDU (Reconfigurable Dataflow Unit) for "dataflow reconfiguration." Both carry a design philosophy fundamentally different from the GPU.
This article aims to unpack the working principles of these two chips as intuitively as possible and to weigh in a balanced way where they win and lose against the GPU. Following the logic of the design rather than the marketing copy, "why such chips are appearing now" becomes naturally clear.
0. Terms First — LPU, RDU, and ASIC
Before the main text, sorting out the acronyms that recur makes reading much smoother.
| Acronym | Expansion | One-line description |
|---|---|---|
| LPU | Language Processing Unit | Groq's inference-specialized chip; deterministic execution |
| RDU | Reconfigurable Dataflow Unit | SambaNova's chip; reconfigures dataflow |
| ASIC | Application-Specific IC | a custom chip tailored to a specific use |
| HBM | High Bandwidth Memory | high-bandwidth memory stacked beside the chip |
| SRAM | Static RAM | fast on-chip memory |
| decode | (inference phase) | the phase generating tokens one at a time |
The last row, decode, is a phase of inference rather than a chip, but it is so important for understanding inference chips that we list it here. We cover it again in detail later.
Both the LPU and the RDU are, in a broad sense, kinds of inference-specialized ASICs. If the GPU is "a general-purpose chip that handles anything," these are "chips narrowed to do inference, especially LLM serving, well." Narrowing in exchange for doing better in that area is the basic bargain of a specialized chip.
Understanding the profit and loss of this bargain is the goal of this article. What you give up (generality, ecosystem) and what you gain (latency, efficiency). Knowing that balance lets you judge for yourself whether "this chip fits my workload."
1. Why Inference-Only Chips
Training and inference look similar but have different workload characteristics.
| Characteristic | Training | Inference (serving) |
|---|---|---|
| Batch size | can be made large | small or 1 (real-time) |
| Key metric | throughput, cost per hour | latency, tokens per second |
| Data reuse | high | low (one weight read per token) |
| Precision | starts high | aggressive quantization |
Real-time LLM serving in particular has small batches and reads the model's weights once per token, so it is memory-bandwidth bound. GPUs are designed to maximize throughput for training, so in this "small batch, low latency" regime they often cannot use their full potential. Inference-only chips aim exactly at this gap.
By analogy, training is like a freight truck moving a lot of cargo at once, while real-time inference is like a delivery motorbike rushing a single package fast. The freight truck (GPU) is overwhelming on throughput, but a motorbike (inference chip) may be better at delivering one package fastest. The two are less in competition than good at different jobs.
2. Groq LPU — The Idea of Deterministic Execution
No cache, no speculation
A typical processor uses plenty of "dynamic" techniques for performance: caches, branch prediction, out-of-order execution. These raise average performance but make execution time jittery depending on input and state. The same operation is fast on a cache hit and slow on a miss.
The Groq LPU strips these dynamic elements away. Execution is deterministic. Exactly which operation runs at which cycle and where is fixed entirely at compile time. The hardware has no "luck-dependent variation" like cache misses or misprediction.
typical processor Groq LPU
----------------- -----------------
scheduling at runtime scheduling at compile time
timing varies by cache hit/miss timing fixed at cycle granularity
hardware decides ordering compiler decides ordering
latency unpredictable latency predictable
The compiler decides everything
This determinism places enormous responsibility on the compiler. Which data must arrive at which compute unit when, and when to read from memory, is laid out by the compiler at cycle granularity ahead of time. The hardware merely executes that plan.
The advantage is clear: latency is predictable and very low. Token generation rate is consistently fast, and tail latency does not spike. Keeping weights centered on on-chip SRAM, rather than leaning on HBM variability, supports this.
# conceptual flow (real API may differ)
import groq_compiler as gc
model = load_transformer("my-llm")
# the compiler schedules operations at cycle granularity.
# the resulting binary runs deterministically.
plan = gc.compile(model, target="lpu", seq_len=2048)
# serving: latency is consistently low.
for prompt in requests:
tokens = plan.generate(prompt, max_tokens=256)
Trade-off
In exchange, a single LPU's on-chip memory is small. Serving a large model requires ganging many LPUs to spread the model out, and this system configuration adds cost and complexity. The deterministic design wins on latency but is disadvantaged at "fitting a large model on one chip."
3. SambaNova RDU — Reconfiguring the Dataflow
Instead of streaming instructions, change the circuit
GPUs and CPUs are von Neumann: "fetch an instruction from memory and execute." The SambaNova RDU takes a different path: a reconfigurable dataflow approach that places the operation graph itself "spatially" on the chip.
Put simply, the model's layers are laid out on the chip like a pipeline, and data flows through that pipeline to produce results. The overhead of fetching instructions repeatedly shrinks, and one layer's output can flow directly to the next without going through memory.
von Neumann (GPU/CPU) dataflow (RDU)
----------------- -----------------
fetch -> execute loop place op graph on chip
store intermediates in memory results flow between units
general, flexible specialized to the graph, efficient
Reconfigurability
The heart of "reconfigurable" is that the same chip can be re-laid-out for a different model or operation graph. It does not swap the whole circuit like an FPGA, but it changes the dataflow configuration at a coarse-grained level to adapt to varied models. This keeps some flexibility, not as much as a GPU but meaningful.
SambaNova has also emphasized handling large models with a hierarchical memory (on-chip plus large external), serving models of hundreds of billions to a trillion parameters with a small number of systems. Combining dataflow placement with the memory hierarchy lets it serve models without slicing them finely.
4. Comparing Against GPUs — Latency, Throughput, Cost
Roughly comparing the three chips:
| Item | GPU | Groq LPU | SambaNova RDU |
|---|---|---|---|
| Design philosophy | general throughput | deterministic low latency | dataflow |
| Strength | both training and inference | real-time tokens/s | efficient large-model serving |
| Single-chip memory | large HBM | small on-chip SRAM | hierarchical |
| Flexibility | best | inference-specialized | medium |
| Ecosystem | CUDA, dominant | growing | growing |
One key intuition: the GPU does "everything reasonably well," and inference-only chips do "better in a specific area." For real-time single-request token generation, Groq shows impressive tokens per second, while SambaNova claims strength in serving models efficiently with few systems. Conversely, if you need varied workloads, rapidly changing model architectures, and rich libraries, the GPU is still the safe bet.
4.5. Same Inference, Different Chips — One Picture
Distilling the comparison so far into one picture shows how the same LLM inference request is handled on each chip.
same inference request, different handling
-----------------------------------
GPU : read weights from HBM, maximize throughput with large batch
LPU(Groq): weights in on-chip SRAM, deterministically fast tokens
RDU(Samba): spread the graph on the chip, efficient large-model serving
The key is that "there is no single right answer." Even for the same request, a different chip is the answer depending on what you optimize. Want throughput and flexibility, the GPU; want consistent low latency for a single request, the LPU; want to run a large model efficiently with few systems, the RDU. Choosing hardware is deciding "what do I want to optimize."
5. Software and the Compiler
The fate of an inference-only chip rests on the compiler. No matter how fast the hardware, it is useless if it cannot map a developer's standard model onto the chip efficiently.
- Groq: with deterministic execution, the compiler must build a cycle-level schedule. When compilation goes well, latency is fantastic, but supporting a new operator or model structure requires compiler work.
- SambaNova: the compiler that places the operation graph as dataflow is central. It converts a graph received from PyTorch and others into an RDU configuration.
Both companies aim to accept a standard frontend like PyTorch and handle the conversion to their chip behind it. The important question for a developer is "is the model or operator I use a first-class citizen in this chip's compiler." If supported, it is smooth; if not, it is painful.
[ PyTorch model definition ]
|
[ vendor compiler ] <- converts/schedules for the chip here
|
[ chip execution binary ]
6. Which Workloads Fit, and the Limits
Good fits
- Low-latency LLM serving: conversational chatbots, voice assistants, models with long reasoning chains
- Production services needing consistent tail latency (especially Groq's deterministic execution)
- Environments that serve stably at scale without changing the model often
Limits
- Memory and model size: small single-chip memory forces spreading large models across many chips, raising system cost and complexity.
- Flexibility: in an era of fast-evolving model architectures, if a specialized chip's compiler cannot keep up with the latest operators, adoption lags.
- Ecosystem: CUDA's moat of libraries, community, and talent pool is still large.
6.5. Why Tokens Per Second Matters — An Intuitive Calculation
Let us conceptually follow what the inference chip's key metric, "tokens per second," actually means.
When an LLM generates an answer, the decode phase produces tokens one at a time. Making each token requires reading the model weights once. So token generation rate is roughly proportional to "how fast you can read all the weights."
token generation rate ~ weight read speed
-----------------------------------
weights in HBM -> HBM bandwidth is the ceiling
weights in on-chip SRAM -> on-chip bandwidth is the ceiling
on-chip bandwidth orders of magnitude faster -> tokens/s greatly improved
Intuitively: the time to read a model's weights once sets the minimum generation time for one token. Reading weights from on-chip SRAM, orders of magnitude faster than HBM, lets you produce tokens much faster for the same model. This is the core principle by which inference-only chips outrun the GPU on single-request tokens per second.
Why does single request matter? When a user chats with a chatbot, that user feels the speed of their one request. Batching 100 requests to raise throughput is good for server efficiency but does not speed up that one user's perceived rate. In real-time conversation, single-request latency is the user experience.
7. Market Positioning — The 2026 Picture
In the 2026 accelerator market, NVIDIA still holds about 75 to 80 percent and dominates both training and inference with the Blackwell generation. The next-gen Vera Rubin is discussed for late in the year targeting higher perf/watt, and Google TPU (Trillium, inference-specialized Ironwood) and AMD MI350X form the competition.
Under this enormous shadow, the spot Groq and SambaNova target is not "everything" but a single point: "inference latency." The rapid growth in inference-ASIC market share, projected from about 15 percent in 2024 to about 40 percent in 2026, is a favorable wind for these specialized chips. In an era where inference capex overtakes training capex, "cheap, fast serving" is a money-making capability.
8. The Developer's View — What to Judge By
If you are evaluating a specialized chip, check the following.
- Is my model supported: confirm the architecture and operators you use are first-class in the vendor compiler.
- Is the real bottleneck latency: if your workload is batch-throughput centric, a GPU may be better. If single-request latency is key, specialized chips shine.
- Total cost of ownership (TCO): look not just at chip price but at the number of systems needed to spread a large model, power, and operational complexity.
- Lock-in risk: assess vendor dependence and ecosystem maturity. Confirm there is a migration path.
8.5. The Larger Trend — Why Inference Chips Now
Stepping back from the rise of inference-only chips, you can see several structural forces acting at once.
- Model maturity: as the race to train giant models settles somewhat, the center of value shifts from "building better models" to "using existing models well." Inference is the business.
- Power ceiling: as data-center power becomes the real ceiling, the efficiency of doing the same work at less power becomes a direct competitive edge.
- Cost pressure: as the cost of serving models takes up a large share of operating cost, hardware that lowers cost per inference flows straight to margin.
- Workload differentiation: as models that unfold long reasoning chains proliferate, low latency in the decode phase becomes more important.
Together these forces drive the rapid growth projected from about 15 percent inference-ASIC market share in 2024 to about 40 percent in 2026. NVIDIA still dominates at about 75 to 80 percent, but cloud providers' in-house inference ASICs and specialized chips like Groq and SambaNova are quickly filling the gap.
2024 -> 2026 inference-ASIC share trend
-----------------------------------
2024: about 15% (mostly GPU)
2026: about 40% (in-house ASIC + specialized chips surge)
backdrop: inference capex first overtakes training capex
The key is that this trend is a structural shift, not a passing fad. As long as models mature, power is the ceiling, and inference is the business, the place for chips wielding inference efficiency keeps widening.
9. The Two Phases of LLM Inference — Prefill and Decode
To properly understand inference-only chips, you need to know that LLM inference splits into two phases.
- Prefill phase: processes the entire input prompt at once to prepare for producing the first token. With many tokens, parallelism is high and it tends to be compute bound.
- Decode phase: generates tokens one at a time, sequentially. Each token requires reading the model weights once, so it is memory-bandwidth bound.
prefill (parallel, compute-bound) decode (sequential, bandwidth-bound)
----------------- -----------------
whole prompt at once generate one token at a time
large compute volume read weights once per token
throughput matters latency matters
Inference-only chips shine mainly in the decode phase. The "typing speed" users feel is decode speed, and that is memory-bandwidth bound. Groq accelerating weight reads with on-chip SRAM and fixing per-token time with deterministic execution is a design aimed precisely at consistent low latency in this decode phase. Knowing this distinction makes clear why "tokens per second" is the key metric for inference chips.
10. What Deterministic Execution Means for Operations
Groq's deterministic execution goes beyond "fast" to give real value in production operations.
The hardest thing to manage in a production service is not average latency but tail latency. Even if 999 of 1000 requests are fast, one spike gives that user a bad experience. An SLA (service-level agreement) is usually defined by a tail metric like p99, not the average.
high-variance system deterministic system
----------------- -----------------
average is fast average is fast too
latency occasionally spikes(bad p99) tail is stable too (good p99)
swayed by luck like cache misses fixed at cycle granularity
capacity planning is hard capacity planning is easy
A deterministic system with no cache misses or scheduling variance has stable tail latency. This leads to two operational advantages. First, SLAs are easier to meet. Second, capacity planning is easier. Since you can predict exactly how long each request takes, you can precisely calculate how many systems you need. High-variance systems tend to over-provision for the worst case, and a deterministic system reduces that waste.
11. Dataflow vs von Neumann — A Deeper Contrast
To understand SambaNova's dataflow approach more deeply, let us place the two paradigms side by side.
A von Neumann machine (CPU, GPU) operates around an "instruction stream." It repeats a cycle of fetching, decoding, and executing instructions from memory. Flexible, but it carries the overhead of fetching instructions and the cost of storing intermediate results in memory.
A dataflow machine (RDU) operates around "data dependencies." It lays the operation graph out on the chip, and when data is ready the corresponding operation executes automatically. One operation's output flows directly as the next operation's input, reducing the round trip of storing intermediates in memory and reading them back.
von Neumann dataflow
----------------- -----------------
instructions at the center data at the center
fetch-decode-execute loop graph spread in space
intermediate-result memory trips results flow directly between units
flexible but overhead specialized to the graph, efficient
Neural network inference is essentially repeatedly executing a fixed operation graph (a chain of layers). If the graph is fixed, laying it out on the chip with dataflow is naturally more efficient than fetching instructions every time with von Neumann. The RDU's "reconfigurable" means this layout can be redone for a different graph, so it adapts to varied models while enjoying dataflow efficiency.
12. Quantization and Precision — A Shared Weapon of Inference Chips
A technique inference-only chips commonly rely on is quantization. Training is usually done at high precision (FP16, BF16, etc.), but inference is often accurate enough at low precision (INT8, FP8, even lower bits).
Low precision gives inference chips three benefits.
- Memory savings: storing weights in fewer bits fits a larger model in the same SRAM.
- Bandwidth savings: fewer weight bits read per token speeds up decode.
- Compute efficiency: low-bit compute units are smaller and faster, fitting more in the same area.
precision memory/bandwidth accuracy risk
-----------------------------------------
FP16/BF16 baseline safe
INT8/FP8 about half mostly safe
lower bits more savings risky per model/layer
Of course lowering precision risks accuracy loss, so finding how far you can go is key. Inference chips and their compilers are usually designed to support such quantization paths well. For a developer, the practical question is "down to what precision does my model retain accuracy on this chip."
13. Summarizing the Two Companies' Strategic Differences
Groq and SambaNova target the same "inference" market but emphasize different things.
| Aspect | Groq | SambaNova |
|---|---|---|
| Core weapon | deterministic execution, ultra-low latency | dataflow, large-model efficiency |
| Memory philosophy | on-chip SRAM centric | on-chip + large external tier |
| Primary scenario | consistent low-latency token generation | serving large models with few systems |
| Emphasized metric | tokens/s, p99 latency | systems per model, efficiency |
Roughly, Groq is closer to "one request fastest and most consistently," and SambaNova is closer to "a large model most efficiently." Which fits, again, depends on the workload. For ultra-low-latency conversational services Groq's strength stands out, and for serving hundreds-of-billions-parameter models with little infrastructure SambaNova's strength stands out.
14. Frequently Asked Questions
Q. Do inference-only chips replace the GPU? No. The GPU remains the center of training and varied workloads. Inference chips are complements targeting the specific area of low-latency serving.
Q. Why is deterministic execution so important? Because stable tail latency makes SLAs easier to meet and capacity planning easier. In production operations, predictability is a big value.
Q. Can I deploy my model right away? If it is operators and model structures the compiler supports, it is relatively smooth. If there are unsupported latest operators, work may be required.
Q. Which metrics should I watch? Single-request tokens per second, p99 latency, and the number of systems and power needed to spread a large model. Watching only average throughput can miss inference chips' strength.
15. The Core at a Glance
- 2026 is the inflection point where inference capex first overtakes training capex, an era favorable to inference-only chips.
- The Groq LPU offers predictable, low latency via deterministic execution, with the compiler scheduling at cycle granularity.
- The SambaNova RDU spreads dataflow on the chip to serve large models efficiently with few systems.
- For both chips the compiler holds the fate, and whether your model/operators are supported is the key to adoption.
- Rather than replacing the GPU, they settle in as complements targeting the specific area of low-latency inference.
Closing
Groq and SambaNova do not try to beat the GPU head-on. Instead they dig deep into a single point where the GPU is structurally disadvantaged: "inference, especially low-latency serving." Groq wields predictable, low latency through deterministic execution; SambaNova wields efficient large-model serving through reconfigurable dataflow.
Their success rests on two things. One is how smoothly the compiler accepts developers' models; the other is how fast and deep the shift toward an inference-centric era proceeds. What is clear is that the era of "one chip does everything well" is fading, and an era of choosing the optimal hardware per workload is arriving. More choices is good news for developers.
Finally, one thing to stress. A good hardware choice does not begin by comparing benchmark numbers. It begins by first measuring "what is the real bottleneck of my workload." If you can answer that question, the tool, whether Groq, SambaNova, or the GPU, follows naturally.
References
- Groq official site: https://groq.com
- SambaNova official site: https://sambanova.ai
- NVIDIA Blackwell platform: https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/
- Google Cloud TPU: https://cloud.google.com/tpu
- Computer architecture research search (arXiv): https://arxiv.org/list/cs.AR/recent
- SemiAnalysis (semiconductor industry analysis): https://www.semianalysis.com