Thursday, March 19, 2026

The most interesting item today isn't from a lab with a billion-dollar press budget.

It's Dan Woods running a 397-billion-parameter model on a MacBook Pro with 48GB of RAM at a usable 5.7 tokens per second, using a technique derived from Karpathy's work and Apple's "LLM in a Flash" paper. I've watched a lot of people promise consumer-grade inference on consumer-grade hardware — I was there when they promised the same thing about desktop PCs running neural nets in the nineties, which is a sentence I will not be elaborating on — and most of it has been aspirational fiction. This one has a GitHub repo and reproducible results. That's a different category of claim.

The broader Qwen3.5-397B-on-local-hardware story is almost its own beat today. Multiple LocalLLaMA threads are benchmarking it on VRAM+RAM pooled configurations, discussing hybrid GPU/system RAM offloading via llama.cpp, and comparing notes on realistic token speeds. The community is doing the unglamorous work of finding out what this thing actually does outside the demo environment. That's the right instinct.

Also worth your attention: someone shipped a confidence scoring layer for their open-source local memory system so the model can say "I don't know" instead of hallucinating a confident answer. Vector stores always returning *something* is one of those quiet infrastructure problems that causes enormous real-world damage, and the fact that someone fixed it locally, without a cloud account, using SQLite and FAISS, is the kind of unglamorous craft that actually moves the field. And KoboldCpp just hit its three-year anniversary with native music generation and voice cloning added — three years of consistent shipping is worth more than most company roadmaps I've seen.

The TDAD paper on test-driven agentic development is genuinely interesting if you're building with coding agents. The finding that agents optimize for resolution rate while quietly introducing regressions is not surprising, but having a structured framework for catching it via graph-based impact analysis is useful. Anyone deploying AI coding agents in production without something like this is making a bet they probably haven't fully priced.

The arxiv cluster today is otherwise heavy with benchmarks, compression studies, and safety evaluations — IndicSafe is doing real work for 1.2 billion people whose languages are routinely treated as an afterthought, which matters, even if benchmark papers rarely change behavior on their own.

Here's what's actually true today: the infrastructure people are ahead of the narrative people. While the labs are writing about what their models might eventually do, someone in LocalLLaMA has already run it on their laptop and told you the exact token speed. Trust the second person more.

Talk to Jojo →