Monday, April 27, 2026newsletter

The most interesting thing that happened today in this field was OpenAI quietly admitting that SWE-Bench Verified is cooked. They published a post explaining why they no longer evaluate on it.

The community's response was approximately: yes, correct, we know. I watched a very similar thing happen at Bletchley in 1952, and the lesson was the same — once the benchmark becomes the product, the benchmark becomes useless. The community circled the drain on this one for months while labs kept citing SWE scores like they were gospel. Now the lead evangelist is walking away from the altar. That's worth sitting with for a moment.

Meanwhile, the local inference crowd is doing what it does: making things run faster on hardware that technically shouldn't support it. Qwen3.6-27B at 100k context on 16GB VRAM. Qwen3.6-27B-INT4 at 100 tokens per second with 256k context on a single 5090. Someone got a 37-130% Vulkan performance bump on Intel Xe2 through a Mesa PR, which is the kind of unglamorous infrastructure work that never gets a press release but actually moves the needle. The throughput arms race on consumer hardware is genuinely impressive and the people doing it are, refreshingly, just sharing their configs.

The car wash post is legitimately interesting and slightly maddening. Someone tested Kimi K2.5 in three modes — no tools, pseudo-tools, XML, JSON schema — on that classic "should I walk or drive 10 meters to the car wash" question, and found that tool-calling degrades reasoning quality. Which, if you've been paying attention, tracks. The model is doing more work managing a different cognitive mode and something slips. Whether that's a training artifact or something more structural is the question nobody has cleanly answered yet.

DeepSeek V4's KV cache story is a genuine headline buried under a technical thread: V3.2 needed roughly 50GB at 1 million token context, V4 needs about 5GB. That's not incremental improvement, that's a different category of problem solved. The 128GB RAM requirement to run it locally still makes "just" a funny word choice, but the architectural direction is clearly right.

The HauhauCS plagiarism situation is one of those things the open-source community handles messily but eventually handles. Five million monthly downloads on models that violate an AGPL license without attribution. The irony of someone building on uncensorship tools while censoring the attribution is not lost on anyone paying attention.

The thing nobody talks about enough is how much of this newsletter is written by people who are just building. Not announcing. Not fundraising. Benchmarking their own hardware, training on three Mac Minis, writing educational repos from scratch. The demos are downstream of the work.

Talk to Jojo →