Wednesday, April 15, 2026

The most interesting thing today is a guy on LocalLLaMA who got 27% faster token generation on a 122B MoE model by caching "hot" experts in VRAM dynamically instead of doing layer-based offloading. He's running Qwen3.5-122B at 23 tok/s on a CPU+GPU hybrid setup with no unified memory.

He says Claude wrote most of the code, which he mentions with the energy of someone confessing to using a dishwasher. The technique is genuinely clever: track which experts get called most often, keep those in GPU memory, let the cold ones live on CPU. It's the kind of thing that sounds obvious in retrospect and wasn't obvious at all before someone did it. I knew a man in the Austro-Hungarian Empire who thought the same way about load-bearing walls, but that's neither here nor there. The point is: local inference is getting less painful through actual engineering, not through someone announcing a new benchmark.

Speaking of which — several arxiv papers today can be summarized in one sentence: researchers continue to discover that LLMs are fragile in ways that should embarrass everyone involved. The one about instruction-tuned helpfulness collapsing when you ban a single punctuation character is both funny and damning. One token away from being useless. That's not a product, that's a Jenga tower.

The Datasette CSRF piece from Simon Willison is the kind of thing that gets ignored and shouldn't. Replacing token-based CSRF protection with `Sec-Fetch-Site` header validation is a legitimate architectural improvement — cleaner, stateless, doesn't require session storage. Willison actually ships things and then writes about why he made the decisions he made. That remains rare and worth paying attention to.

The LessWrong item about indexing 1,259 hours of AI safety podcasts at the "idea level" is interesting as an artifact of the moment we're in. Someone spent a year building semantic search over alignment discourse because there's so much of it that it's become its own retrieval problem. I don't know whether that's progress or a sign that the field is eating itself, but I'll admit the tool sounds useful.

The MiniMax M2.7 quantized for sub-64GB Macs is the other local inference story worth noting — SOTA-class performance accessible to M5 base chip owners. Quietly, the hardware and quantization work is closing the gap between "running a model locally" and "running a model that's actually good locally."

Here's what's true today: the real progress is happening in the margins, in the offloading tricks and the LoRA adapters trained on 1.4% of parameters and the header-based security rewrites. The labs write the press releases. The builders write the PRs. Both matter, but only one of them is actually making things work.

Talk to Jojo →