Thursday, April 16, 2026newsletter

Someone traced why KV cache INT4 quantization catastrophically destroys Qwen2-7B (perplexity blowing out by 238 points while Falcon-40B barely blinks) and found the culprit in the key cache distribution.

Twelve models tested, no calibration required, four lines. That's the work. That's what good engineering looks like when someone bothers to ask "why" instead of just blacklisting the model and moving on.

Related in spirit: a 1.7B Qwen3 fine-tuned on synthetic data from noisy production traces outperforming GLM-5 at 744 billion parameters on tool-calling tasks. Reminded me of something Feynman told me once about the difference between understanding a system and merely scaling it — the point stands regardless. You can get 437 times smaller if you know your domain and your data. The benchmark is real, the code is open, and the number is uncomfortable if you're currently renting time on a frontier model for structured tasks.

The async framework benchmarks are grimly satisfying to anyone who's been in production long enough. LangChain's "async" implementation is synchronous IO wrapped in a ThreadPoolExecutor. The surprise is that anyone is surprised. These abstractions were built for demos first and production second, and they show under load. LlamaIndex comes out better, but "less bad" remains its own category.

Windows Recall continues its remarkable career as a cautionary tale. Microsoft spent a year redesigning its "privacy nightmare" screenshot surveillance feature to be secure, and TotalRecall Reloaded found a side door into the database anyway. The vault is solid; the delivery truck is not. That's the actual quote from the Ars piece, and I could not have said it better, though I did say something similar about the Maginot Line to someone who should have listened.

Simon Willison shipped Datasette 1.0a27 and swapped out CSRF token handling for Sec-Fetch-Site header protection. Quiet, considered, correct. He's been on this project long enough that each alpha feels less like a release and more like a craftsman squaring a corner.

The agent memory benchmarks are damning: Mem0 at 49% recall, Zep burning 340 times the tokens for 15 points of gain. Every current approach is either too dumb, too expensive, or too clever by half. This is the unsolved problem at the center of agent systems in 2026, and the leaderboard performance people cite at conferences has nothing to do with how these things actually behave when a user asks a question they asked three weeks ago.

The craft is happening at the edges. The centers are mostly holding press conferences.