KV Cache Calculator

This calculator estimates how much GPU memory you need to serve a large language model. It accounts for model weights, the KV cache (the part most people under-budget), activation/runtime overhead, and GPU memory currently available on common accelerators.

It’s intended to pair with the KV Cache blog post . All math runs in your browser — nothing is sent to a server. Paste a Hugging Face model id (e.g. Qwen/Qwen2.5-7B-Instruct) to auto-fill the architecture from the model’s config.json — only metadata is fetched, never weights — or pick a curated preset, then tweak any field in Advanced options to override.

Model preset

Or load from Hugging Face

Only the model's config.json and metadata are fetched (a few KB) — never weights. Gated repos return 401; try a public mirror like unsloth/<model>.

Concurrent sequences (batch)

Sequence length (tokens) Prompt + generated tokens per sequence (i.e. context length). Sliding-window attention only caches a window of these.

KV cache precision

Weight precision

Target GPU

× GPUs

Advanced options ↺ reset all overrides

Runtime overhead (% of weights) CUDA workspace, activations, allocator slack.

Fixed framework overhead (GiB) CUDA driver, NCCL/communication buffers.

Shared prefix tokens (prefix caching) Tokens shared across all sequences (counted once instead of per-sequence).

PagedAttention efficiency 96% — higher = less fragmentation waste.

Architecture (override / custom model) Edits here become "manual overrides" that survive switching presets. A "↺ reset" link will appear next to any field you've changed.

Attention type

Total params (billions)

Layers (attention layers)

Q heads

KV heads

Head dim

MLA kv_lora_rank

MLA qk_rope_head_dim

Local window size

Local layers per cycle

Global layers per cycle

Attention layers (full) Of the total layers, how many keep a sequence-growing KV cache.

Memory used

Weights KV cache Activations + framework

Model weights
KV cache (per token)
KV cache (total)
Activations + runtime overhead
Total VRAM required
Available VRAM
Headroom

Show formulas

Hugging Face config audit