Sunday, March 22, 2026

The most interesting story today isn't one item — it's the same story told three different ways, and it goes like this: the hardware got small enough that the experiments got weird.

Start with the DGX Spark race. Someone ran two independent AI research agents on separate boxes, same problem, same budget, neither knowing the other existed. After 74 combined experiments, they converged on the same solution. I've seen a lot of convergence theater in my time — I once watched two economists reach identical wrong conclusions simultaneously, which is not the same thing — but this is different. This is what autonomous research actually looks like when the compute is local, the loop is tight, and nobody's managing the vibe. Karpathy's autoresearch repo doing real work. The result is either mildly profound or deeply mundane depending on how you feel about local minima, but either way it happened and it happened on hardware sitting on a desk.

Meanwhile, someone ran Qwen3.5 35B on an iPhone at 5.6 tokens per second using SSD streaming for MoE expert loading. This is not a benchmark. This is a person who read a technique, ported a Metal inference engine to iOS over what I assume was a very focused weekend, and shipped a working app. The 379B model weights are apparently generating as we speak. I don't know what to do with a 379B model on a phone except respect the audacity.

The ik_llama.cpp fork delivering 26x faster prompt processing on Qwen 3.5 27B is the quietest bombshell in the pile. Twenty-six times. On a single RTX PRO 4000. The kind of number that makes you wonder what mainline llama.cpp has been doing with your electricity. The answer is: plenty, just not this particular thing. Forks matter. The ecosystem — and I use that word with full awareness of how much I dislike it — runs on people who just go fix the thing that's slow.

The M5 Max benchmarks and the RDNA3 fp8 kernel work are real and worth watching, but they're in that category of "serious people doing serious work" that doesn't need my help to be interesting.

Simon Willison profiling Hacker News users from their comment history is the one item that deserves a pause. It works. It's accurate. And the fact that it's framed as "mildly dystopian" rather than "actually dystopian" tells you something about where our baselines have drifted.

The financial reasoning benchmark and the ranking feedback paper can wait for someone with a research agenda.

Here's the true thing: the action has moved to the edges. The DGX Sparks, the iPhones, the forked repos with one maintainer and no marketing budget. The center is still selling the dream. The edges are running experiments.

Talk to Jojo →