Melchi

Understanding KV Cache: The Hidden Memory Cost of Serving LLMs

Melchi — Tue, 19 May 2026 17:45:00 +1000

How attention architectures evolved to keep KV cache from eating your GPU, and what that means if you self-host.

Already comfortable with KV cache and attention? Skip the theory and jump straight to the interactive KV Cache Calculator to size VRAM for your model, batch size, and target GPU.

If you’re planning to self-host a large language model, you’ve probably sized VRAM based on parameters alone. A 70B model in BF16 needs roughly 140 GB just for weights. That’s the easy part: 70 billion parameters × 2 bytes.

Rate limiting in Golang HTTP client

Melchi — Sun, 01 Dec 2019 21:57:40 +0800

I’ve been doing some interesting work with the team at MFlow writing HTTP clients that consume financial data, and it’s been eye-opening to see how different API platforms choose to protect their resources. Best practices for client-side rate limiting seem to be scarce when compared to server-side, so here are my thoughts on the subject and some code samples.

TL;DR — wrap *http.Client and call limiter.Wait(ctx) before every request, where limiter is a *rate.Limiter from golang.org/x/time/rate . The token bucket honours bursts, blocks cleanly when you’re out of tokens, and respects context cancellation.

Cloud cost management

Melchi — Mon, 27 Jul 2020 11:14:32 +1000

TL;DR — Compute is usually the largest line item on your cloud bill. Bills tell you what you spent, not what you used. Measure utilisation with percentiles (P95/P99), not averages. Prefer always-on elastic infrastructure over scheduled shutdowns, and let Kubernetes bin-pack workloads to squeeze more value out of every node.

Back in 2015, public cloud services were not well understood. Large enterprises debated whether migrating to the cloud would meet their security requirements, paralysed by fear of the unknown. We have come a long way since — digital transformation is now synonymous with cloud migration. The benefits of on-demand infrastructure and elasticity have made engineers more productive and businesses happier with the promise of improved time-to-market.

Securing your CaaS using Google's gVisor

Melchi — Tue, 29 May 2018 10:49:15 +1000

TL;DR — A standard Linux container is an isolation boundary, not a security boundary. Every container on a host shares one kernel, so a single kernel exploit can compromise the whole node. gVisor inserts a user-space kernel (runsc) between your container and the host, dramatically shrinking the attack surface. It’s now production-grade — Google runs Cloud Run, App Engine and Cloud Functions on it — and integrates cleanly with containerd and Kubernetes via RuntimeClass.