Understanding KV Cache: The Hidden Memory Cost of Serving LLMs

How attention architectures evolved to keep KV cache from eating your GPU, and what that means if you self-host.
Already comfortable with KV cache and attention? Skip the theory and jump straight to the interactive KV Cache Calculator to size VRAM for your model, batch size, and target GPU.
If you’re planning to self-host a large language model, you’ve probably sized VRAM based on parameters alone. A 70B model in BF16 needs roughly 140 GB just for weights. That’s the easy part: 70 billion parameters × 2 bytes.
What’s less obvious is the second memory consumer that grows while the model is actually serving requests: the Key-Value (KV) cache.
KV cache scales with every cached token in an active request: prompt tokens, generated tokens, and any prefix-cache entries the engine keeps resident. It also scales with the number of concurrent sequences. At 32K–128K context, KV cache can easily become the largest single thing on the GPU. If you don’t budget for it, you’ll serve one long-context user when you wanted to serve many.
This post walks from the basics of attention through the architectural and runtime tricks people use to shrink KV cache. By the end you should be able to look at a model card and roughly predict its memory profile.
Part 1: Attention, a quick refresher
The original transformer attention
The 2017 paper “Attention Is All You Need” introduced Multi-Head Attention (MHA), the mechanism that lets a model look back at previous tokens when generating the next one.
Three steps:
- Project. For each token, build three vectors (a Query (Q), a Key (K), and a Value (V)) by multiplying the token representation by learned weight matrices.
- Score. Take the dot product of Q with K. This answers “how relevant is each previous token to what I’m generating now?”
- Aggregate. Softmax the scores, then take a weighted sum of V. The result is a context-aware representation of the current token.
The formula:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
Why we need a cache
During training, the model usually processes the whole training sequence in parallel. There’s no decode-time KV cache because every token is available at once. Training is still memory-hungry (activations, gradients, optimizer state) but that’s a different problem.
Inference is the autoregressive case, where the model generates tokens one at a time. Each new token needs to attend to every token that came before it. Without a cache, generating token #1000 would require recomputing K and V for all 999 previous tokens at every step. The KV cache is the obvious fix: store the K and V projections once and reuse them on every subsequent step. Hugging Face’s documentation explains it the same way (Hugging Face cache explanation ).
Key insight: We cache K and V, not Q. Q is the “question” being asked by the current token and only matters for that one step. K and V are the “memory” the question gets asked against, and that memory has to stick around.
Part 2: How big is the KV cache?
The formula
The serving-time number you actually want is:
Total KV cache bytes =
2 × num_layers × num_key_value_heads × head_dim
× cached_tokens × active_sequences × bytes_per_element
Where:
- 2 because we store both K and V.
- num_layers because every attention layer has its own KV cache.
- num_key_value_heads is the number of KV heads, which is not always the same as the number of query heads.
- head_dim is the per-head vector size, often 64, 128, or 256.
- cached_tokens is prompt tokens plus generated tokens still resident in the cache.
- active_sequences is your active batch / concurrent sequences.
- bytes_per_element is 2 for BF16/FP16, 1 for FP8/INT8, 0.5 for INT4-style packed storage.
For standard Multi-Head Attention (MHA):
num_key_value_heads = num_attention_heads
For Grouped-Query Attention (GQA) or Multi-Query Attention (MQA):
num_key_value_heads < num_attention_heads
That distinction is the whole game. GQA and MQA shrink KV cache by reducing how many K/V heads are stored, while keeping more Q heads for model capacity.
A concrete example: a 70B-scale MHA baseline
The example below is intentionally a worst-case MHA baseline. It is not a claim that every 70B-class model uses this exact configuration; many of them use GQA, MLA, sliding windows, or hybrid attention.
| Parameter | Value |
|---|---|
| Layers | 80 |
| Query heads | 64 |
| KV heads | 64 |
| Head dimension | 128 |
| Precision | BF16 (2 bytes) |
Per token:
2 × 80 × 64 × 128 × 2 = 2,621,440 bytes ≈ 2.5 MiB
Now scale it up:
| Scenario | Cached Tokens | Active Sequences | KV Cache Size |
|---|---|---|---|
| Single user, short chat | 2,048 | 1 | 5 GiB |
| Single user, long context | 32,768 | 1 | 80 GiB |
| 8 users, moderate context | 8,192 | 8 | 160 GiB |
| 16 users, long context | 32,768 | 16 | 1.25 TiB |
Reality check: A 70B model’s weights alone need around 140 GB in BF16. In the worst-case MHA baseline above, 16 concurrent users at 32K cached tokens add about 1.25 TiB of KV cache, before allocator overhead, activation workspace, fragmentation, and tensor-parallel/runtime overhead.
Part 3: The scaling problem, visualized
The chart below uses the same 70B-scale MHA baseline and shows how fast cache grows with sequence length and active batch size.
Linear vs. quadratic: what scales how?
You may have heard that attention scales “quadratically” with sequence length. That is true for full attention computation and for the naïve attention-score matrix. It is not true for the persistent KV cache itself.
Component Scaling Explanation KV cache storage O(n) linear One K and one V vector per cached token per attention layer. Double the cached tokens, double the cache. Full attention computation during prefill/training O(n²) quadratic Each token attends over many other tokens in the full sequence. Naïve attention-score matrix O(n²) quadratic The seq × seqscore matrix is quadratic. Kernels like FlashAttention avoid materializing the full matrix and bring extra attention workspace down to roughly linear, while also cutting HBM traffic.Decode-time attention per generated token O(n) per token A new token attends over the cached context. The longer the cache, the more K/V state you have to read. This post focuses on the persistent memory footprint: the KV cache that lives on the GPU for the lifetime of the request. That cost is linear in cached tokens. FlashAttention-style kernels reduce attention workspace and memory traffic, but they don’t remove the KV cache, and they don’t make a longer cache cheaper to read.
The tension is direct:
- Longer contexts → more information available to the model.
- Larger batches → better hardware utilization and throughput.
- More KV memory → fewer users per GPU and higher cost per query.
That tension is most of why model builders and inference-engine teams have spent the last few years inventing ways to either shrink the KV cache or reduce the cost of moving it around.
Part 4: Five attention mechanisms that actually move the needle
The five approaches below are the ones I keep running into when reading modern model configs. They cover head sharing, low-rank compression, locality, and hybrid recurrent/attention layouts.
1. Multi-Query Attention (MQA)
Paper: Fast Transformer Decoding: One Write-Head is All You Need (Shazeer, 2019) Seen in: PaLM, Falcon, StarCoder/StarCoderBase, Gemma 2B, and some smaller variants. StarCoder2 moved from MQA to GQA. KV cache reduction: 98.4% versus a 64-KV-head MHA baseline.
Instead of each attention head having its own K and V projection, every query head shares a single K head and a single V head. Q stays per-head.
Standard MHA: 64 KV head pairs = 64 K heads + 64 V heads
MQA: 1 KV head pair = 1 shared K head + 1 shared V head
For a 64-head MHA baseline, MQA shrinks the KV cache by a factor of 64:
1 - (1 / 64) = 98.4375% smaller
K and V across heads are redundant enough that one shared pair is often good enough for decoding. Shazeer’s paper is mostly about memory bandwidth: repeatedly loading the K/V tensors during incremental decoding is the bottleneck, and one shared pair makes that much cheaper.
The trade-off is real, though. MQA tends to lose a little quality compared with MHA, especially on tasks that benefit from multiple independent K/V views, and it changes training dynamics. That’s the reason GQA exists.
| Model Config | KV Cache per Token (BF16) |
|---|---|
| MHA (64 KV heads) | 2 × 80 × 64 × 128 × 2 = 2.5 MiB |
| MQA (1 KV head) | 2 × 80 × 1 × 128 × 2 = 40 KiB |
2. Grouped-Query Attention (GQA)
Paper: GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (Ainslie et al., 2023) Seen in: Many modern Llama, Mistral, Qwen, Gemma, and StarCoder2 variants; exact configs vary by model. KV cache reduction: 87.5% for an 8-KV-head config against a 64-KV-head MHA baseline.
GQA is the middle ground. Instead of 64 independent KV heads or 1 shared KV head, query heads are split into groups, and each group shares one K head and one V head.
MHA: 64 KV head pairs
GQA: 8 KV head pairs (example: 8 groups)
MQA: 1 KV head pair
The GQA paper describes it as an interpolation between MHA and MQA, and reports quality close to MHA with inference efficiency closer to MQA. In practice, that’s a good description of what most modern frontier models picked.
The trade-off is usually small compared with MHA, but it isn’t literally free. The exact balance depends on the model, training recipe, number of KV heads, and workload.
| Model Config | KV Cache per Token (BF16) |
|---|---|
| GQA-8 (8 KV heads) | 2 × 80 × 8 × 128 × 2 = 320 KiB |
Practical note: When sizing a real model, do not use
num_attention_headsblindly. Look for the config field that’s usually callednum_key_value_heads.
3. Multi-Head Latent Attention (MLA)
Paper: DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model Seen in: DeepSeek-V2, DeepSeek-V3, DeepSeek-R1-family models. KV cache reduction: Roughly an order of magnitude in DeepSeek-style configurations; DeepSeek-V2 reports a 93.3% reduction versus its MHA baseline.
Instead of caching full K and V tensors, MLA caches a compressed latent representation. DeepSeek-V2 uses low-rank joint compression for K/V, and handles positional/RoPE information separately so caching still works cleanly.
Standard MHA/GQA-style cache:
Cache K and V states per attention layer per cached token
DeepSeek-style MLA cache:
Cache compressed latent KV state,
plus small positional/RoPE-related state depending on the implementation
A simpler way to picture it:
Standard: cache full K + full V
MLA: cache a compressed latent representation from which attention can be computed
The reason it works is that K and V have a lot of redundancy. A learned low-rank representation can preserve the parts that matter for attention while throwing away most of the memory footprint. DeepSeek-V3 sticks with the same architecture and pairs MLA with DeepSeekMoE for efficient inference (DeepSeek-V3 technical report ).
One nuance worth flagging: MLA isn’t just “compress K and V to 512 dimensions and decompress later.” In optimized implementations, some of the projection matrices can be absorbed or rearranged so the full K/V tensors don’t always need to be reconstructed at all. The exact cache size depends on compressed dimension, head dim, positional-key dimension, number of layers, and what the kernel actually does.
The trade-off: MLA saves a lot of memory, but it’s architecturally more invasive than GQA, and it adds projection/kernel complexity. The 93.3% number is DeepSeek-V2-specific, not a universal constant.
DeepSeek-V4 note: DeepSeek-V4 should not be described as “plain MLA.” The Hugging Face Transformers documentation says DeepSeek-V4 replaces V3’s MLA with a hybrid local + long-range design, and the DeepSeek-V4-Pro model card describes hybrid compressed attention mechanisms for long context (DeepSeek-V4 docs , DeepSeek-V4-Pro model card ).
4. Sliding Window Attention (SWA)
Papers / references: Longformer popularized local-window attention for long documents; Mistral 7B brought GQA + SWA into a compact decoder LLM. Seen in: Mistral 7B and several later long-context models. Gemma 3 uses a local/global interleaving pattern rather than making every layer local. KV cache reduction: Bounded by window size, but only for layers that are actually local-window layers.
Instead of every attention layer attending to every previous token, restrict some or all of them to a fixed local window, for example the last 4,096 tokens.
Standard full attention:
Token 50,000 can attend to all previous tokens
Sliding window attention:
Token 50,000 attends only to a recent window,
for example tokens 45,905–50,000 with a 4,096-token window
Most next-token predictions lean heavily on nearby context, and deeper layers carry information forward anyway, so local windows can still propagate signal through the network. Mistral’s 7B release notes describe each layer attending to the previous 4,096 hidden states and call out reduced inference cost for long sequences (Mistral announcement ).
The nuance: sliding-window memory is only fully bounded if every relevant attention layer is a sliding-window layer and the serving implementation actually evicts old K/V states. Hybrid local/global models still have global layers whose KV cache grows with total context.
Gemma 3 is a clean example. Google’s Gemma 3 material describes a 5-to-1 interleaving pattern: 5 local attention layers with a 1024-token sliding window followed by 1 global attention layer (Google Gemma 3 explainer , Gemma 3 technical report ).
For hybrid local/global attention, a more accurate rough sizing factor is:
effective_cached_tokens_per_6_layers =
5 × min(sequence_length, local_window) + 1 × sequence_length
So even with a 5-local/1-global pattern, memory still grows with long context, because the global layer keeps the full sequence.
| Scenario | Standard MHA KV Cache | Pure SWA, 4K Window | 5:1 Local/Global, 1K Local Window |
|---|---|---|---|
| 32K seq, 1 user | 80 GiB | 10 GiB | ~15.4 GiB |
| 128K seq, 1 user | 320 GiB | 10 GiB | ~55.4 GiB |
The trade-off: true long-range dependencies can weaken when too many layers are local-only. Hybrid local/global patterns are the usual fix, because they keep at least one path that sees the full sequence while still cutting most of the local-layer KV cache.
5. Hybrid linear-attention / state-space architectures
Examples: Jamba , Qwen3-Next, Zamba, Hymba, and other Mamba/DeltaNet/RWKV-style research and production models. Seen in: Qwen3-Next-style Gated DeltaNet + attention layouts, Jamba-style Mamba + Transformer layouts. KV cache reduction: Often 50–80% at the attention-layer level, depending on how many layers stay full attention.
Not every layer has to do full attention. Some attention layers can be replaced with state-space models or linear-attention layers, which keep a fixed-size recurrent or state tensor instead of a KV cache that grows with sequence length.
Pure Transformer (64 attention layers):
64 layers × sequence-growing KV cache
Hybrid (example: 16 attention + 48 linear/state layers):
16 layers × sequence-growing KV cache
+ 48 layers × fixed-size recurrent/state cache
Linear and state-space layers process sequences efficiently with state that doesn’t grow with length. A small number of attention layers can preserve the exact token-to-token lookup where it actually matters.
Qwen3-Next is a current example. The Qwen3-Next-80B-A3B-Instruct model card describes 48 layers arranged as 12 repeats of 3 Gated DeltaNet layers followed by 1 Gated Attention layer, a 3:1 linear-state-to-attention layout (Qwen model card , Qwen blog ).
The non-attention layers are not free, though. They keep fixed-size recurrent/state/conv caches and they still consume compute and memory. The win is that the state is generally independent of total context length.
The trade-off: more complex architecture and a more complex kernel stack. Some tasks that need exact retrieval from very long context probably want more attention layers. Hybrid designs are increasingly practical, but the right ratio is model- and workload-dependent.
Part 5: Putting it all together
The image below compares mechanisms using a reference 70B-scale model. It’s an illustrative baseline, not a universal measurement for every real model.
Here is a reference table for that 70B-scale, 80-layer, 64-head, head-dim-128 baseline at 32K cached tokens, batch size 1, BF16:
| Mechanism | Assumption | KV Cache at 32K | Reduction vs. MHA |
|---|---|---|---|
| MHA baseline | 64 KV heads | 80 GiB | — |
| GQA-8 | 8 KV heads | 10 GiB | 87.5% |
| MQA | 1 KV head | 1.25 GiB | 98.4% |
| DeepSeek-style MLA | 93.3% reduction from reported DeepSeek-V2-style comparison | ~5.4 GiB | ~93.3% |
| Pure SWA | All layers local, 4K window | 10 GiB | 87.5% |
| Gemma-style local/global | 5 local layers at 1K + 1 global layer | ~15.4 GiB | ~80.8% |
| Hybrid linear/state + attention | 25% of layers keep attention KV | 20 GiB | 75% |
The real-world landscape
The current model landscape is more varied than “everything uses GQA”:
- GQA is common because it gives a strong memory/quality trade-off in practice.
- MQA is very memory-efficient, but the quality and stability trade-off is why GQA is the more typical choice for larger modern transformer models.
- MLA is a DeepSeek-family differentiator for V2/V3/R1-style models, but newer DeepSeek-V4-style designs use a different hybrid compressed-attention approach.
- Sliding-window and local/global attention are common long-context patterns. Local layers can bound local KV memory, but global layers still grow with sequence length, so don’t assume the whole model is bounded.
- Hybrid linear/state-space architectures reduce the number of layers that need a sequence-growing KV cache, but their fixed recurrent/state caches are not literally zero.
Practical takeaway: When estimating VRAM for a model, do not use the MHA formula blindly. Check the model card or config for:
num_key_value_headsversusnum_attention_heads- Number of attention layers versus linear/state-space layers
- Sliding-window size and whether global attention layers exist
- KV-cache precision used by your inference engine
What’s next
There’s now an interactive KV Cache Calculator that lets you plug in any model architecture and see how much VRAM you need for your target batch size and sequence length. It accounts for:
- MHA, MQA, GQA, MLA-style compression, sliding windows, and hybrid attention/state layers
- KV-cache precision: BF16/FP16, FP8/INT8, and optional INT4-style storage
- Number of active sequences and cached tokens per sequence
- Runtime overhead for allocation, fragmentation, CUDA graphs, communication buffers, and serving-engine metadata
- Prefix sharing and paged-cache utilization assumptions
The formula to keep in your head is:
Total KV Cache =
2 × layers_with_attention × kv_heads × head_dim
× cached_tokens × active_sequences × bytes_per_element
Everything else in this post is about making one of those terms smaller, or about making the serving engine waste less memory around it.
Got questions, or a specific model you want me to size? Reach out to me on LinkedIn . This kind of capacity planning is genuinely fun for me, so I’m happy to dig in.