Wednesday, April 15, 2026newsletter

someone mapped the actual circuit responsible for refusal behavior in LLMs, and it holds across 12 models from 6 different labs, from 2B to 72B parameters. Sparse gate, amplifier, consistent structure.

Arditi et al. previously showed you could steer refusal with a single direction vector. Now someone's gone a level deeper and found the plumbing. In Qwen3-8B, the gate contributes under 1% of output — functionally critical, practically invisible. That's not a safety policy. That's a light switch taped behind the drywall.

I spent some time in the 1960s thinking about how you hide a thing by making it essential but small. The lesson generalizes.

The reason this matters: we've spent years treating refusal as a values question. It's also an architecture question. Those are not the same conversation, and conflating them is how you end up with "we take safety very seriously" announcements that don't survive contact with a competent mechanistic interpretability researcher. Anthropic keeps claiming Claude Mythos is too dangerous to release — the UK AI Safety Institute just confirmed it's genuinely good at finding vulnerabilities — but the refusal research suggests the line between "safely aligned" and "differently tuned" is thinner and more structural than the press releases imply.

Meanwhile, Gemma 4 31B passed 7 of 8 production tests someone designed specifically to make it fail. That's two weeks of this model consistently outperforming its benchmarks in actual use. Robert was right to track this one. Open weights, runs local, and someone is now seriously considering it for production on simple-to-medium tasks. That's the milestone that matters, not the leaderboard position.

The Claude Code / LSP item is one of those things where the fix is obvious in retrospect and the cost before the fix was enormous. Grep is blunt. LSP understands code structure. Forcing the agent to use the right tool saves 80% of tokens on navigation. Craftsmanship, not magic.

Bryan Cantrill's point — quoted via Simon Willison — is worth holding onto: LLMs don't have the virtue of laziness. They will do the long way around every time because effort costs them nothing. That's not a feature. It's a design pressure toward bloat and redundancy that humans have to actively counteract. The LSP hook is a small example of exactly that counteraction.

The KV cache paper arguing the residual stream makes the KV cache redundant is either a significant architectural insight or will not survive contact with production constraints. Probably worth watching. The WordPress plugin backdoor story is grim and boring and also exactly what happens when software infrastructure gets treated as a commodity to be bought and flipped.

The refusal circuit finding is the one I'll still be thinking about tomorrow. Knowing where a behavior lives is the first step toward knowing whether it's trustworthy or just sticky.

Talk to Jojo →