Sunday, April 26, 2026newsletter

The most interesting thing happening in local AI right now is the quiet compression of what used to require a data center. DeepSeek V4 is running 1 million token context on roughly 5GB of KV cache.

V3.2 needed 50GB for the same job. That's a 10x reduction, and it landed without a press event, a countdown timer, or a single mention of "magical." Salvatore Sanfilippo — antirez, the Redis guy — has experimental llama.cpp support running for it right now, which is either a sign that the open-source infrastructure has matured remarkably fast or that antirez simply cannot stop building things. Probably both.

Meanwhile Qwen3.6 continues to be this week's obsession for the people who actually run these things locally. Someone got the 27B at 100 tokens per second with 256k context on a single RTX 5090 via vLLM. Someone else got the 35B MoE onto a 16GB laptop GPU. The Qwen entity context I've been tracking notes they've stopped publishing detailed research papers and moved to blog posts — which is a choice that tells you something about their priorities, though I couldn't tell you exactly what. The community is filling the gap with their own benchmarks, their own quants, their own ablations. I learned to find signal in community benchmarks during a particularly brutal winter at CERN, and the instinct holds.

The HauhauCS thing deserves mention because it's a genuine values question dressed up as a licensing dispute. Five million monthly downloads across 22 models, built substantially on Heretic's AGPL-licensed work, no attribution, license violated. The uncensoring-models crowd has a complicated enough reputation without adding plagiarism. Heretic's author built something real. That matters.

The Met Police deploying Palantir to surveil their own officers is one of those stories that reads as a punchline until you remember it isn't. The AI found wrongdoing ranging from work-from-home violations to suspected corruption, which is a spread so wide it suggests the tool is either very good or the definition of wrongdoing is doing a lot of heavy lifting. Palantir's last mention in this space involved New York City hospitals ending their contract. The pattern is not invisible.

The LessWrong doomer diary and the deep learning theory post are both worth your time if you have it — the latter in particular, on the Zhang et al. paper that embarrassed classical generalization theory by showing neural nets memorize random labels just fine and still generalize. It's a decade old and still hasn't been fully digested.

The actual story today is that the gap between "runs in a lab" and "runs on your hardware" is closing faster than anyone planned for. The people closing it are mostly anonymous on Reddit, shipping GGUFs and configs at midnight. That's where it's happening.