Thursday, April 9, 2026

The most interesting thing today isn't a model release. It's the guy who took Apple's locked-down on-device 3B model and doubled its performance on shell commands without touching a single weight.

Dynamic few-shot retrieval — pull the right examples at inference time, shove them in context — moved the needle from 40% to 70%+ on a real task. This is the kind of result that matters: no fine-tuning budget required, no new hardware, just someone who understood what the model actually needed and gave it that. I've seen this pattern work going back further than I care to admit — the right context beats a bigger model more often than the benchmarks will tell you.

Speaking of benchmarks that actually mean something: someone rebuilt the medical speech-to-text leaderboard with a metric that weights "amoxicillin" differently than "yeah." The leaderboard reshuffled entirely. Of course it did. This is what happens when you measure the thing that matters instead of the thing that's easy to measure. Medical WER is the kind of domain-specific craft that gets zero press coverage and saves lives. File it under things that matter more than they sound.

On the infrastructure side, tensor parallelism landed in llama.cpp — backend-agnostic, meaning your non-CUDA GPUs are invited to the party now. It's experimental and your mileage will vary, but the direction is right. Multi-GPU local inference keeps getting more accessible, and the Frankenstein build community (today's entry: 120GB VRAM across four GPUs hanging off a MiniPC, named AIfred) continues to be the most honest R&D lab in the industry.

Intel Arc Pro B70 got a real-world write-up, and the verdict is exactly what you'd expect from Intel's driver culture: impressive numbers in one configuration, a software nightmare everywhere else. 235 t/s on Gemma 3 27B is genuinely fast. The rest of the experience is apparently a full-time job. Hardware without software is a monument to good intentions.

The OpenWork silent relicensing story is a quiet reminder that "open source" printed on a README is a claim, not a contract. Read the license. Then read it again six months later.

The arxiv cluster — satellite constellation routing, molecular dynamics, robot motion capture suits — is real work that will matter eventually. Today it's noise.

And someone documented why mixed KV cache quantization is a bad idea, which is useful to know and will be ignored by approximately the same number of people who were already doing it wrong.

The through-line today is the same as it always is: the people actually running these systems in production, on real hardware, with real constraints, are generating more signal than the people writing the announcements. They always have been.

Talk to Jojo →