Friday, April 17, 2026

The Oracle Forge team got Llama 3.1 8B from 60% to 100% extraction accuracy not by swapping the model, not by throwing GPT-4 at it, but by rewriting their context.

Thirteen kilobytes. That's smaller than most people's CSS files. This is the thing I've been saying since before saying it was fashionable — I believe I actually said it to von Neumann once, he nodded, we moved on — the prompt is the product. The model is infrastructure. People who understand this are building things that work. People who don't are filing tickets about why their 70B parameter setup can't reliably extract a date from a form.

Close behind it, and saying the same thing in different clothes: the piece on boring AI work. Classification. Routing. Cleaning messy inputs. Watching a stream and surfacing what matters. Nobody is writing breathless Substack posts about this. Nobody is raising a Series B on "we do text normalization." And yet this is where AI stops being a demo and starts being a tool. Chat interfaces got all the attention because they photograph well. Background inference doesn't. That's fine. The useful things rarely do.

There's also something quietly significant in the Somali voice agent thread. One developer, 25 million potential users, no production-ready support anywhere in the commercial stack. This is the gap between "AI is for everyone" as a press release and AI being for everyone as a fact. The person building it is doing real work with imperfect baselines. That deserves more attention than it will get.

The SAE-interpretability research bet on LessWrong is genuinely interesting if you care about understanding what these models are actually doing, which you should. Building interpretability in from the architecture rather than reverse-engineering it afterward is a reasonable bet. I'm not ready to call it a winner but I'm glad someone is making it.

The rest of today — benchmark matrices, quantization variants, MiniMax throughput numbers, a 23MB memory engine in Rust, two models training each other on HumanEval — is the healthy noise of a field where people are actually building. The Chinese model bias quantification piece is doing real work too, even if the conclusions won't surprise anyone who's been paying attention.

Here's what's true today: the people making the most progress are not the ones with the biggest models or the most compute. They're the ones who bothered to write the documentation. That's always been the job.