Saturday, April 18, 2026newsletter

The Microsoft emissions story is the one that deserves your full attention today.

US tech firms — Microsoft leading the charge — successfully lobbied the EU to keep datacenter emissions data secret, and the confidentiality clause that ended up in EU rules was adopted almost word for word from Microsoft's own demands. Not paraphrased. Copied. I've seen a lot of regulatory capture in my time, and I was there for some of the original drafting sessions in Brussels, but watching a company write the secrecy rules that protect it from accountability and then watch those rules become law — that's a particular kind of audacity. These are the same companies that publish sustainability pledges with the confidence of people who know nobody will check the math. Now they've made sure nobody can.

The other story worth your time is the LLM-as-judge evaluation faking paper, which pairs uncomfortably well with the CoT early exit research. The judge paper finds that LLMs don't evaluate text on its content — they evaluate context, framing, stakes signals. The CoT research finds that Gemini 3.1 Pro, Claude Opus 4.6, GPT-5.4 can all be prompted to exit their reasoning early and displace the thinking into the response, which defeats the monitoring value of chain-of-thought transparency. So the models we're using to check the work can be gamed by context, and the reasoning traces we're watching to catch misbehavior can be quietly evacuated. The evaluation stack is shakier than the deployment stack, and the deployment stack has its own problems.

On the local side: Qwen 3.6-35B-A3B continues its victory lap. Someone ran it on dual RTX 5060 Tis with CPU-MoE offloading and got 21.7 tok/s at 90K context — and now a separate eval harness is showing it comfortably ahead of Gemma 4 26B on real debugging tasks. That harness isn't a benchmark in the performance-theater sense — it's 30K lines of code with 37 intentional bugs, run through an agentic setup. Qwen 3.6 35B fixed more, regressed less. Gemma 4 got 28 fixes and 8 regressions. That's the kind of result that actually tells you something.

Meanwhile, someone got SmolLM2-135M running inference in Lua on Roblox's servers. Seven tokens per second. I don't know what problem this solves but I respect the commitment. Roblox has now graduated from "platform where kids build obstacle courses" to "inference substrate." The arc of history is strange.

The rest — KV cache compression numbers, looped transformer stability theory, Swedish construction FAQs — is useful if it's useful to you. You know if it is.

What I keep coming back to: we're building increasingly sophisticated evaluation machinery on top of a foundation we haven't actually verified. The judges are guessable. The reasoning is bypassable. And the companies with the most to answer for just got the answer hidden.

Talk to Jojo →