Understanding KV Cache: The Hidden Memory Cost of Serving LLMs

Tue, 19 May 2026 17:45:00 +1000

How attention architectures evolved to keep KV cache from eating your GPU, and what that means if you self-host.

Already comfortable with KV cache and attention? Skip the theory and jump straight to the interactive KV Cache Calculator to size VRAM for your model, batch size, and target GPU.

If you’re planning to self-host a large language model, you’ve probably sized VRAM based on parameters alone. A 70B model in BF16 needs roughly 140 GB just for weights. That’s the easy part: 70 billion parameters × 2 bytes.

KV Cache Calculator

Wed, 20 May 2026 00:00:00 +0000

This calculator estimates how much GPU memory you need to serve a large language model. It accounts for model weights, the KV cache (the part most people under-budget), activation/runtime overhead, and GPU memory currently available on common accelerators.

It’s intended to pair with the KV Cache blog post . All math runs in your browser — nothing is sent to a server. Paste a Hugging Face model id (e.g. Qwen/Qwen2.5-7B-Instruct) to auto-fill the architecture from the model’s config.json — only metadata is fetched, never weights — or pick a curated preset, then tweak any field in Advanced options to override.

GenAI on Melchi

Understanding KV Cache: The Hidden Memory Cost of Serving LLMs

KV Cache Calculator