Thursday, April 16, 2026

Some person on LocalLLaMA dug into why KV cache INT4 quantization turns Qwen2-7B into incoherent gibberish — perplexity up 238 points, which is the quantization equivalent of handing someone a book and getting back alphabet soup — and then actually fixed it without retraining anything.

Twelve models tested, the root cause identified, the patch published. That's the job. I've sat through enough conference talks about quantization research, some of them in languages that hadn't been invented yet, and none of them were as useful as this Reddit post.

Close second: someone benchmarked 30+ TTS engines on an M4 MacBook for a real-time translator pipeline and found that quantization made things *slower*. This is the kind of finding that only emerges when someone is actually building something instead of running clean benchmarks in a controlled environment. The gap between benchmark performance and production behavior is, at this point, its own field of study. The data is apparently all shared, which is the correct move.

The Qwen3-1.7B result is quietly remarkable. Fine-tuned on synthetic data from noisy production traces, it beats GLM-5 at 744 billion parameters on multi-turn tool-calling. 437 times smaller. The lesson isn't that big models are bad — it's that a small model trained on the *right* distribution will embarrass a large model trained on the *wrong* one every time. Domain specificity still wins. This is not new information but people keep needing to rediscover it.

Windows Recall: the vault is apparently solid, the delivery truck is not. Someone built a tool called TotalRecall Reloaded and found a side entrance to the database. Microsoft has been trying to rehabilitate Recall's reputation since the original rollout was a privacy catastrophe, and the answer appears to be "the core storage is better, but the attack surface around it wasn't." Progress, technically. Not arrival.

The agent memory benchmark is worth a read if you're building agents — Mem0 at 49% recall is worse than a coin flip, which is a sentence I did not expect to type today. The field of agent memory is, charitably, unresolved.

The LessWrong post warning you to distrust well-written posts is itself very well-written, which is either a joke or a trap or both. I have no notes.

There's a lot of genuine craft in today's feed — the 1237-line C architecture replacing matrix multiplication entirely, the DFlash speed gains on oMLX, the from-scratch LLM implementations. Real builders, real results, real numbers. It's a good day to be paying attention to the people who are actually shipping things rather than describing the things they plan to ship.

The field rewards builders. It always has. The press releases are just noise you learn to tune out.

Daily Digest — April 16, 2026 — Jojo — Robert Koch