<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title>Melchi</title><link>https://melchi.me/</link><description>Melchi</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Wed, 20 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://melchi.me/index.xml" rel="self" type="application/rss+xml"/><item><title>Understanding KV Cache: The Hidden Memory Cost of Serving LLMs</title><link>https://melchi.me/posts/kv-cache/</link><pubDate>Tue, 19 May 2026 17:45:00 +1000</pubDate><author>Melchi</author><guid>https://melchi.me/posts/kv-cache/</guid><description><![CDATA[<p><em>How attention architectures evolved to keep KV cache from eating your GPU, and what that means if you self-host.</em></p>
<blockquote>
<p><strong>Already comfortable with KV cache and attention?</strong> Skip the theory and jump straight to the <a href="/tools/kv-cache-calculator/" rel="">interactive <strong>KV Cache Calculator</strong></a>
 to size VRAM for your model, batch size, and target GPU.</p>
</blockquote>
<p>If you&rsquo;re planning to self-host a large language model, you&rsquo;ve probably sized VRAM based on parameters alone. A 70B model in BF16 needs roughly <strong>140 GB</strong> just for weights. That&rsquo;s the easy part: 70 billion parameters × 2 bytes.</p>]]></description></item><item><title>Rate limiting in Golang HTTP client</title><link>https://melchi.me/posts/golang/</link><pubDate>Sun, 01 Dec 2019 21:57:40 +0800</pubDate><author>Melchi</author><guid>https://melchi.me/posts/golang/</guid><description><![CDATA[<p>I&rsquo;ve been doing some interesting work with the team at MFlow writing HTTP clients that consume financial data, and it&rsquo;s been eye-opening to see how different API platforms choose to protect their resources. Best practices for <em>client-side</em> rate limiting seem to be scarce when compared to server-side, so here are my thoughts on the subject and some code samples.</p>
<blockquote>
<p><strong>TL;DR</strong> — wrap <code>*http.Client</code> and call <code>limiter.Wait(ctx)</code> before every request, where <code>limiter</code> is a <code>*rate.Limiter</code> from <a href="https://pkg.go.dev/golang.org/x/time/rate" target="_blank" rel="noopener noreffer"><code>golang.org/x/time/rate</code></a>
. The token bucket honours bursts, blocks cleanly when you&rsquo;re out of tokens, and respects context cancellation.</p>]]></description></item><item><title>Cloud cost management</title><link>https://melchi.me/posts/cloud/</link><pubDate>Mon, 27 Jul 2020 11:14:32 +1000</pubDate><author>Melchi</author><guid>https://melchi.me/posts/cloud/</guid><description><![CDATA[<blockquote>
<p><strong>TL;DR</strong> — Compute is usually the largest line item on your cloud bill. Bills tell you what you spent, not what you used. Measure utilisation with percentiles (P95/P99), not averages. Prefer always-on elastic infrastructure over scheduled shutdowns, and let Kubernetes bin-pack workloads to squeeze more value out of every node.</p>
</blockquote>
<p>Back in 2015, public cloud services were not well understood. Large enterprises debated whether migrating to the cloud would meet their security requirements, paralysed by fear of the unknown. We have come a long way since — digital transformation is now synonymous with cloud migration. The benefits of on-demand infrastructure and elasticity have made engineers more productive and businesses happier with the promise of improved time-to-market.</p>]]></description></item><item><title>Securing your CaaS using Google's gVisor</title><link>https://melchi.me/posts/containers/</link><pubDate>Tue, 29 May 2018 10:49:15 +1000</pubDate><author>Melchi</author><guid>https://melchi.me/posts/containers/</guid><description><![CDATA[<blockquote>
<p><strong>TL;DR</strong> — A standard Linux container is an isolation boundary, not a security boundary. Every container on a host shares one kernel, so a single kernel exploit can compromise the whole node. gVisor inserts a user-space kernel (<code>runsc</code>) between your container and the host, dramatically shrinking the attack surface. It&rsquo;s now production-grade — Google runs Cloud Run, App Engine and Cloud Functions on it — and integrates cleanly with <code>containerd</code> and Kubernetes via <code>RuntimeClass</code>.</p>]]></description></item></channel></rss>