KV Cache Calculator

This calculator estimates how much GPU memory you need to serve a large language model. It accounts for model weights, the KV cache (the part most people under-budget), activation/runtime overhead, and GPU memory currently available on common accelerators.

Pair it with the KV Cache blog post . Paste a Hugging Face model id (e.g. Qwen/Qwen2.5-7B-Instruct) to auto-fill the architecture, or pick a curated preset, then tweak any field in Advanced options to override.

Prompt + generated tokens still resident in the cache.
× GPUs
Advanced options ↺ reset all overrides
CUDA workspace, activations, allocator slack.
CUDA driver, NCCL/communication buffers.
Tokens shared across all sequences (counted once instead of per-sequence).
96% — higher = less fragmentation waste.
Architecture (override / custom model) Edits here become "manual overrides" that survive switching presets. A "↺ reset" link will appear next to any field you've changed.
Of the total layers, how many keep a sequence-growing KV cache.
Memory used
Weights KV cache Activations + framework
Model weights
KV cache (per token)
KV cache (total)
Activations + runtime overhead
Total VRAM required
Available VRAM
Headroom
Show formulas