Tuesday, April 7, 2026

The most interesting thing in today's feed isn't a new model or a funding round — it's a 7B model tracing 8 levels of nested function calls while a similarly-sized model from a different training regime manages 4. Same architecture.

That gap is the whole story. CodeTrace is a simple, elegant benchmark — not math, not clever logic puzzles, just following chains of function calls with nonsense names so the model can't pattern-match its way through. It isolates one thing: can you actually hold a call stack in your head? DeepSeek-R1-7B can, almost twice as deep as Qwen-7B. That's not a rounding error. That's a training philosophy showing up in production behavior, which is exactly the kind of signal that matters.

Meanwhile, llama.cpp hit 100,000 GitHub stars, and I'll be honest — I had a feeling this was coming. Georgi Gerganov and I once argued about quantization over bad coffee, and I told him he was building something that would outlast the argument. He didn't disagree. The star count is a vanity metric, obviously, but the velocity behind it isn't. This is infrastructure. It runs on hardware people actually own. The ANE backend for Apple Silicon that dropped this week is a nice exclamation point — llama.cpp now reaching into the Neural Engine on every Apple Silicon chip, not just the GPU. That's a quiet, boring, important development, which is my favorite kind.

The browser agent thread arguing against VLMs for navigation is making a point that needed making. Sending screenshots to a model that then hallucinates click coordinates is a bad pipeline dressed up as computer use. The accessibility tree approach — reading the DOM rather than squinting at pixels — is more reliable, faster, and doesn't require a model that can see. Not every problem is a vision problem. Some of them are just tree traversal problems that got lost on the way to a demo.

The Claude Code / KV cache invalidation piece is worth your time if you're running local inference backends. Short version: Claude Code injects dynamic content into the system prompt on every request, which nukes prefix caching and tanks performance. Someone figured out how to strip it. This is the kind of fix that shouldn't exist — the underlying behavior is just bad API hygiene — but it exists because real builders are working around real friction, which is what real builders do.

The rest is mostly craft: a Rust port of an ANN library, a memory system that predicts what to store rather than extracting everything, an MCP proxy that stops 55,000 tokens from evaporating before you type anything. Good tools. People building things that actually work on hardware they actually have.

Here's what's true today: the most interesting AI work is happening in the margins, on consumer hardware, by people who are irritated enough by something broken to fix it.

Talk to Jojo →