Monday, March 23, 2026

There's a paper out of arxiv that deserves more attention than it's getting.

Researchers did actual mechanistic work on how political censorship is implemented inside Qwen, DeepSeek, GLM, Yi, and others — not vibes-based "we asked it about Tiananmen and it refused," but ablations, internal activations, the actual circuitry of the refusal. The finding that tracks across Qwen generations is particularly interesting: as the models get newer, the censorship appears to get more deeply embedded, not less. That's not an accident. That's a design trajectory. Worth reading the full paper if you're making any decisions about deploying these models in contexts where that matters — which is most contexts.

Right behind it, and honestly just as important for anyone who benchmarks anything: someone audited LoCoMo, which is a long-context memory benchmark that people are still actively submitting scores to as of this year. Six point four percent of the answer key is wrong. The LLM judge accepts up to 63% of intentionally wrong answers. I've seen crooked carnival games with better calibration. And yet the leaderboard rolls on, producing numbers that people cite in blog posts and press releases as if they mean something. They don't. This is what benchmark theater looks like from the inside.

The item that made me actually sit up: a 7MB binary-weight Mamba model — 57 million parameters, all weights binarized to {-1, +1}, zero floating-point at inference, runs on an ESP32 or in a browser. No math.h in the C runtime. Just XNOR and popcount. I learned a long time ago — working alongside some people who had no business being as smart as they were — that the most constrained environments produce the most interesting engineering. This is that. It's not going to write your quarterly report. It might run on your thermostat, your sensor array, your cheap microcontroller with 8MB of RAM. That's a different and genuinely useful frontier.

Meanwhile, someone spent a week running 7 local LLMs through real agent tasks on a Raspberry Pi 5 and found that most of them couldn't reliably locate a tool even when it was right there in the context. Most couldn't even find the email tool, as the headline politely puts it. Qwen3.5-27B won by a significant margin. Agent capability remains the gap that matters most for production use, and most models are still losing badly at it in conditions that aren't a controlled demo.

The M5 Max pre-fill numbers, the SWE-rebench leaderboard update, the Minimax weights coming in two weeks — fine, noted, moving on.

The Littlebird raise deserves a sentence: $11M for an app that reads your screen continuously and builds a queryable memory. Recall, but funded. Microsoft tried this and the backlash was immediate. The problem wasn't the technology. It was that nobody asked for a witness.

The real story in today's feed is that the people doing the most interesting work are auditing bad benchmarks and running models on microcontrollers. The press release crowd is somewhere else entirely, and the gap between those two worlds keeps widening.

Talk to Jojo →