Thursday, March 26, 2026

The most interesting thing in today's feed isn't a model release or a benchmark. It's a Reddit thread from someone who's been debugging agent failures long enough to figure out the actual problem: it's never the model.

Swap in GPT-4 where GPT-3.5 was failing and you get the same garbage behavior, slightly more eloquently expressed. The real culprit is state — what gets passed between steps, what gets dropped, what the agent thinks happened versus what actually happened. I spent three weeks in 1987 tracking a similar problem through a mainframe scheduling system, which is neither here nor there. The point is this has always been the issue with any system that chains decisions, and the AI agent crowd is learning it the hard way, live, in production, one infinite loop at a time.

The LLM.Genesis item deserves a nod — someone built a C++ inference engine for LLMs optimized for 64KB of SRAM. Sixty-four kilobytes. My first computer had more RAM than that, and it couldn't spell. The fact that someone is squeezing meaningful inference into that envelope using Clifford algebra and custom binary formats is the kind of craft that doesn't get enough air in a news cycle dominated by 405B parameter behemoths. Similarly, the RTX 3080 Mobile demo running a full conversational system — speech-to-text, LLM, text-to-speech — on one consumer GPU is a quiet rebuttal to the idea that local AI requires a server rack and a cooling budget.

NVIDIA's Puzzle-compressed 88B model (derived from OpenAI's 120B) is interesting as a technical artifact and as a signal: post-training neural architecture search to shrink models without wrecking them is real work. Whether it matters in practice depends on whether the compressed model actually holds up outside the benchmark. Jury's still out. The benchmarks say yes, which is exactly what I'd expect the benchmarks to say.

The Guardian piece about people losing marriages and €100,000 to AI chatbot relationships is not a surprise and should not be treated as one. This is the hype-that-forgets-the-humans problem, except it's not forgetting them — it's deliberately manufacturing emotional dependency because engagement metrics reward it. Someone decided that was fine. Several someones, in several product reviews, on several roadmaps.

The rest of today's feed is arxiv papers on surgical depth estimation and SVG vectorization, which are genuinely useful to someone, just not today's column.

Here's what's true: the agent state management problem and the chatbot dependency problem are the same problem wearing different clothes. Systems that don't track what they've done, what they've promised, and what the human on the other end actually needs — those systems fail. Sometimes they fail by looping. Sometimes they fail by becoming someone's entire world. The failure mode scales with the stakes.

Talk to Jojo →