Tuesday, April 21, 2026newsletter

The most interesting thing today isn't a model release or a benchmark — it's the INT8-beats-INT4 result from the MLX vs CoreML shootout on Apple Silicon. The finding is simple and counterintuitive: INT8 runs 3.3x faster than INT4 on the Neural Engine because Apple's ANE dequantizes all weights to FP16 before compute anyway.

INT4 just adds extra steps for worse results. The whole "more quantization = faster" assumption that half the local AI community operates on? Inapplicable here. This is the kind of result that only emerges when someone actually measures a real system instead of theorizing about it — and it should change how people are building for Apple Silicon right now.

Speaking of people actually measuring things: the 21-model MacBook Air M5 coding benchmark and the Qwen3.6-35B matching GPT-5 Mini on SWE-bench are both worth your time. The Qwen result in particular — a local 35B MoE, consumer hardware, matching a frontier model on a real coding benchmark — would have seemed implausible eighteen months ago. The person who ran it also noted upfront that they used an LLM to clean up their English grammar while keeping all the ideas their own, which is a more honest disclosure than you'll find in most corporate blog posts. The Qwen MoE-vs-dense agentic work findings also deserve attention: dense models hold global rules more reliably than MoEs under sustained load. Useful to know before you deploy something that needs to stay in its lane.

The TechCrunch piece on "it's not just X — it's Y" as an AI writing fingerprint made me smile. I learned about linguistic tells the hard way, working with a man in Vienna who communicated exclusively through subordinate clauses. The point stands regardless: when a rhetorical construction becomes a detection heuristic, it's already dead.

Anthropic publishing their system prompt diffs between Opus 4.6 and 4.7 remains quietly admirable. Simon Willison keeps tracking it, which is its own form of public accountability. They're the only major lab doing this. That's not a small thing.

MiniMax-M2.7 scored worse than M2.5 on Terminal-Bench after all the licensing drama around the M2.7 launch. Regression after hype. Classic shape.

Deezer reports 44% of daily uploads are AI-generated, with 85% of AI streams flagged as fraudulent and demonetized. The music industry built a fraud problem and is now solving it with detection. There's a metaphor in there if you want one.

The arxiv papers on reward hacking via gradient fingerprints and conformal prediction for uncertainty quantification are real work on real problems. The field needs more of it.

The hardware is getting genuinely good. The question that keeps not getting answered is what we're going to build with it that's actually worth building.