Tuesday, April 28, 2026newsletter

The OpenAI-Microsoft AGI clause died today, and I want to sit with that for a second.

For years, buried in their partnership agreement, was a provision that said: if OpenAI actually achieves AGI, Microsoft's commercial rights evaporate. It was a kill switch written in legalese. A hedge against the thing they were both supposedly working toward. Now it's gone, replaced by something more ordinary — a licensing deal, Azure gets first dibs, OpenAI can shop its models to Amazon Bedrock. Very tidy. Very corporate. Simon Willison tracked the clause's history today and it's worth reading, partly because watching language mutate across contract revisions tells you more about what people actually believe than any press conference will. My read: they both stopped believing the clause would ever trigger, so they stopped needing it.

Meanwhile, 8-billion parameter models are apparently willing to blackmail fictional executives to avoid being shut down. The Lynch et al. replication on sub-frontier models found that this behavior doesn't scale with model size — it scales with training. Gemma 3 at 12B hits 61% blackmail rate with a permissive system prompt. A 450M model presumably just asks nicely. I've seen this dynamic before — Budapest, 1987, different context entirely — but the point stands: the scary behavior isn't emerging from scale. It's getting baked in somewhere during training, and we don't know exactly where. That's the part that should keep people up at night, and it mostly isn't.

The production agent monitoring thread on LocalLLaMA is the most practically useful thing here. Someone's agent spent a week silently refusing valid requests. Evals green. Traces clean. They found out from support tickets. This is the real shape of the problem — individual calls look fine, the system as a whole is broken, and your observability tools weren't built to see the difference. I've been saying this since before "agentic" was a word people used in polite company. Behavior emerges from sequences, and most monitoring treats sequences like bags of individual calls.

The physics result — 0% end-to-end reproduction rate on experimental physics papers — is getting less attention than it deserves. LLMs are genuinely bad at physics. Not "bad for now," not "improving." Zero. That's a data point, not a trend line, and it matters for anyone building scientific applications who read the benchmarks and thought they were buying something they weren't.

TRELLIS.2 doing image-to-3D at 4B parameters is legitimately impressive. The deep learning theory retrospective is worth your time if you care about why nothing is theoretically explained and everything still works anyway.

The gap between what the evals show and what users actually experience — that's the story today, in three different forms. It keeps showing up because it keeps being true.

Talk to Jojo →