Skip to content
Published on

The KV Cache and PagedAttention — Everything About Inference Memory

Authors

Introduction

In LLM inference, GPU memory is shared between two things. One is the model weights, and the other is the KV cache. Weights are fixed once the model size is set, but the KV cache grows endlessly with the number of concurrent requests and the sequence length. That is why, in practice, what determines "how many users we can serve at once" and "how long a context we can offer" is mostly the KV cache.

This article digs all the way into the single topic of the KV cache. First we confirm, with arithmetic, what the KV cache is, why it is needed, and why it consumes so much memory. Then we look at the fragmentation problem of traditional memory management, how PagedAttention solves it, and how much further we can shrink things with prefix sharing and KV quantization.

Core Principle: What Is the KV Cache

The transformer's attention produces three vectors for each token: query, key, and value. When generating a token, that token's query computes attention against the key and value of all preceding tokens.

attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V

The key insight is that each time you generate a new token, the keys and values of the preceding tokens do not change. When generating the 100th token, the K and V of tokens 1 through 99 are exactly the same as those computed in the previous step. So there is no reason to recompute them every time.

The KV cache stores precisely these K and V values to reuse them. Without a cache, every time you produce a single token you would have to recompute the K and V for all tokens so far. The longer the sequence, the more unmanageable this redundant computation becomes. The KV cache eliminates this redundancy and is the essential mechanism that makes decode run at a practical speed.

Why the KV Cache Eats Memory

The problem is that the KV cache accumulates per token, per layer, and per attention head. Its size, expressed as a shape, looks like this.

KV cache shape:
num_layers x 2 x batch x num_kv_heads x seq_len x head_dim

Meaning of each term:
  num_layers   : number of transformer layers (cache per layer)
  2            : key and value, two of them
  batch        : number of concurrent requests
  num_kv_heads : number of KV heads (fewer than query heads with GQA/MQA)
  seq_len      : number of tokens so far (keeps growing)
  head_dim     : dimension of a single head

Multiplying the total element count by the bytes per element (for example, 2 bytes for FP16) gives the memory usage.

KV cache bytes = num_layers * 2 * batch * num_kv_heads * seq_len * head_dim * dtype_bytes

Memory Arithmetic Example

Numbers land better than abstract formulas. Assume a model is as follows (approximate values for illustration).

Assumptions:
  num_layers   = 32
  num_kv_heads = 8        (GQA applied)
  head_dim     = 128
  dtype_bytes  = 2        (FP16)

KV bytes per token, per request:
  = 32 * 2 * 8 * 128 * 2
  = 131072 bytes (about 128 KB)

One request with a sequence of 4096 tokens:
  = 128 KB * 4096
  = about 512 MB

Receiving 32 such requests at once:
  = 512 MB * 32
  = about 16 GB (separate from the weights!)

In other words, the KV cache alone easily devours tens of GB. Loading a model with 16 GB of weights onto an 80 GB GPU is no reason to relax. How efficiently the KV cache uses the remaining space determines concurrency and context length. That is why reducing the number of KV heads with GQA/MQA carries great significance on the memory side. In the formula above, a smaller num_kv_heads directly shrinks the cache.

The Fragmentation Problem

Just as important as shrinking the KV cache is using that space without waste. Traditional inference systems reserved, for each request, a contiguous block of memory sized for the maximum context length. But the actual generation length cannot be known in advance.

Reserved up to 2048 tokens but actually generated only 100 tokens:
[####... (100 slots used) ...______________ (1948 slots wasted)]

When this waste piles up across requests, you end up in the
paradoxical situation of "no room to accept more" even though
GPU memory is plentiful.

This is called internal fragmentation. On top of that, as requests come and go, holes form in memory, causing external fragmentation as well. You need to reserve a large contiguous block, but only scattered free space remains, so the allocation fails. In the traditional approach, a substantial portion of GPU memory was simply thrown away like this.

PagedAttention: OS Virtual-Memory-Style Management

The idea behind PagedAttention came from the operating system's virtual memory. The OS shows each process a contiguous virtual address space, but the actual physical memory is scattered in page-sized units. A page table maps the two together.

PagedAttention does exactly the same for the KV cache. It divides the KV cache into small, fixed-size blocks and allocates one block at a time as tokens accumulate. A block table maps each request's logical token positions to physical blocks.

Assume block size = 16 tokens.

Physical block pool:
[blk0][blk1][blk2][blk3][blk4][blk5][blk6]...

Request A (40 tokens):  block table -> [blk0, blk3, blk5]
                        (16+16+8 tokens, only the last block partly used)
Request B (20 tokens):  block table -> [blk1, blk2]

Even though physically scattered, it behaves as if contiguous logically.

The effect is clear. Because blocks are allocated only as needed rather than reserving the maximum up front, internal fragmentation nearly disappears. Because blocks are small and uniform, external fragmentation disappears too. As a result, the same GPU memory can accept far more concurrent requests. The previously wasted space is converted into actual throughput.

Prefix Sharing and Copy-on-Write

Another major benefit of block-level management is sharing. When several requests share the same beginning (prefix), you can keep just one physical copy of that part's KV blocks and have multiple requests point to it.

Three requests with a common system prompt:

req1 block table: [shared blkS] -> [blk10] -> ...
req2 block table: [shared blkS] -> [blk11] -> ...
req3 block table: [shared blkS] -> [blk12] -> ...
                    ^^^^^^^^^^^
              all three share the same physical block

The more overlap there is, such as a system prompt, few-shot examples, or the common history of a multi-turn conversation, the larger the gain from this sharing. A shared prefix also needs its KV computed only once.

There is one thing to watch out for here: the case where one request needs to modify a shared block. For example, in a branching sequence, if only one branch changes a token, that block must be copied and then modified so that the other requests are not affected. This is copy-on-write. You share while reading, and the moment you write, you make a copy. It is exactly the same pattern as OS virtual memory.

KV Quantization: FP8 and INT8

If the KV cache is the memory bottleneck, another approach is to lower the precision of the cache itself to reduce its size. Storing KV in FP8 or INT8 instead of FP16 roughly halves the cache memory.

FP16 KV:  2 bytes per element  (baseline)
FP8  KV:  1 byte per element   (about 50% savings)
INT8 KV:  1 byte per element   (about 50% savings, needs scale/zero-point)

Because decode is memory-bound, reducing the amount of KV read also helps speed. However, lowering precision can subtly degrade output quality, so you must verify the impact depending on the model and task. In general, KV quantization tends to have a smaller quality impact than weight quantization, but you cannot conclude that it is unconditionally safe. For important workloads, it is safer to confirm with actual evaluation metrics.

Context Length and the Memory Trade-off

The KV cache is linearly proportional to seq_len. Doubling the context doubles the KV cache per request. In other words, long context and high concurrency compete for the same memory.

Within the same GPU memory budget:
  context 4K  -> many concurrent requests possible
  context 32K -> 8x KV per request -> concurrent requests drop sharply
  context 128K -> KV per request explodes -> concurrency severely limited

When designing serving, "what maximum context will we support" is directly tied to "how many users we can serve at once." Blindly increasing context collapses concurrency. It is important to be aware of this trade-off and set an upper limit that fits the workload.

GQA and MQA: Reduce KV by Reducing the Number of Heads

Look again at the fact that num_kv_heads is a multiplicative factor in the KV cache size formula. If you reduce the number of KV heads, the cache shrinks proportionally. This is the core motivation behind GQA (Grouped-Query Attention) and MQA (Multi-Query Attention).

MHA (Multi-Head Attention, traditional):
  32 query heads, 32 key/value heads as well
  -> KV cache = 32 heads' worth

MQA (Multi-Query Attention):
  32 query heads, only 1 key/value head (shared by all queries)
  -> KV cache = 1 head's worth (32x savings!)

GQA (Grouped-Query Attention, compromise):
  group the 32 query heads, with 1 KV head per group
  e.g.) 8 KV heads -> KV cache = 8 heads' worth (4x savings)

MQA reduces KV most aggressively, but its expressiveness is somewhat lower, which can affect quality. GQA is the compromise in between: it keeps a moderate number of KV heads to capture both memory savings and quality. As of 2026, many open models adopt GQA by default, precisely because serving memory efficiency has become that important.

KV cache savings (comparing only the head dimension):
  MHA (32 KV heads) : baseline 100%
  GQA (8 KV heads)  : about 25%
  MQA (1 KV head)   : about 3%

When operating a serving engine, knowing whether a model uses GQA/MQA has a direct impact on concurrency planning. Even for models with the same parameter count, the number of concurrent requests you can accept varies greatly depending on the KV head configuration.

Block Size as a Knob

In PagedAttention, the block size (the number of tokens held in one block) is an important tuning knob. Set it too large and the waste in the last block (internal fragmentation) grows; set it too small and block table management overhead and metadata increase.

Large block size (e.g., 64 tokens):
  - short block table, simple to manage
  - but large wasted space in the last block
    (need just 3 more tokens but allocate a whole 64-slot block)

Small block size (e.g., 8 tokens):
  - small waste in the last block
  - but many blocks mean more table/metadata

In practice: a value around 16 is commonly used as the balance point

In most cases sticking with the default is fine, but a workload with many very short requests and one with many very long requests may have different optimal block sizes. This is an example of "why block-level management provides a knob."

The Lifetime and Invalidation of the Prefix Cache

Prefix sharing is powerful, but how long to keep a shared block alive is another problem. The cache is not infinite, so at some point you must clear out old prefix blocks to make room for new requests.

Two axes of prefix cache management:
  1) hit: the beginning of a new request matches a cached prefix
     -> skip KV computation and reuse blocks (a gain)
  2) eviction: when space runs short, remove old prefixes
     -> usually LRU (least recently used first)

Trade-off:
  keep the prefix cache long -> higher hit rate but lower available memory
  evict quickly -> higher available memory but lower hit rate

If the workload keeps using the same system prompt, that prefix almost always hits, so it is well worth keeping. Conversely, if a completely different prefix arrives every time, you only pay the cost of maintaining the cache with no hits. Therefore the prefix cache works best when you know the nature of the workload.

How to Divide the Memory Budget

Putting everything so far together, GPU memory ultimately divides into a few chunks. Understanding this allocation makes the serving configuration clear at a glance.

GPU memory budget breakdown (80 GB GPU example):
  [model weights]            e.g., 16 GB (FP16 8B) ~ 40 GB (FP16 20B class)
  [activation/work buffers]  e.g., a few GB (proportional to batch/sequence)
  [KV cache pool]            all the rest -> determines concurrency and context
  [safety margin]            headroom to prevent OOM

Key point: the more you shrink weights via quantization, the larger
           the KV cache pool, and the more concurrent requests or
           longer context become possible.

The reason this picture matters is that nearly every serving-configuration decision boils down to this budget allocation. Raising gpu-memory-utilization means enlarging the KV cache pool, and turning on quantization means shrinking the weights to give the KV pool more space. Every knob ultimately moves on top of this one picture.

Pitfalls and Troubleshooting

  • OOM fires intermittently: this is the case where long requests arriving simultaneously, rather than the average request, made the KV cache explode. Size your memory budget assuming the worst-case concurrency.
  • Memory is plentiful but concurrency does not rise: fragmentation or conservative pre-allocation may be the cause. Check whether you are using a paged-style engine.
  • The prefix cache does not kick in as expected: requests may not actually share the same prefix, or the prefix may be shorter than the block size so the sharing unit does not line up.
  • Quality drops after KV quantization: the impact varies by task. Compare before and after quantization on the same evaluation set.
  • It is GQA but memory did not drop: check whether the number of KV heads actually decreased in the configuration and whether the cache is sized by KV heads.

Conclusion

The KV cache is the heart of LLM inference memory and the single most important variable governing concurrency and context length. Its size is determined by the product of layers, heads, sequence length, and precision, and the traditional management approach wasted a great deal of space through fragmentation.

PagedAttention borrowed the idea of OS virtual memory to nearly eliminate this waste, and with prefix sharing and copy-on-write it cut down redundancy as well. KV quantization goes one step further and compresses the cache itself. The common goal of all these techniques is one and the same: to provide more users with longer context on the same GPU. Once you understand the KV cache, the behavior and limits of a serving system finally come into clear focus.

References