KV Cache Calculator
This calculator estimates how much GPU memory you need to serve a large language model. It accounts for model weights, the KV cache (the part most people under-budget), activation/runtime overhead, and GPU memory currently available on common accelerators.
Pair it with the KV Cache blog post
. Paste a Hugging Face model id (e.g. Qwen/Qwen2.5-7B-Instruct) to auto-fill the architecture, or pick a curated preset, then tweak any field in Advanced options to override.
| Model weights | |
|---|---|
| KV cache (per token) | |
| KV cache (total) | |
| Activations + runtime overhead | |
| Total VRAM required | |
| Available VRAM | |
| Headroom |