Monday, March 23, 2026

The most interesting thing in today's feed isn't a product launch or a model release. It's a Norwegian IT admin, seven weeks into local LLMs, running a multi-agent research pipeline on an RTX 5090 and actually reporting time savings.

No VC deck, no press release — just a person who built something, measured it, and told the truth about what happened. That's the whole game, and most of the industry forgets it.

Closely behind that: the WMB-100K benchmark for AI memory systems. I know, I said benchmark theater is tedious, and it mostly is — but testing memory at 100,000 turns instead of the usual 600-1,000 is a different thing entirely. The false memory probes are the interesting part. "I remember you told me X" is the exact failure mode that will embarrass someone in production. Someone is finally asking the right question, which is: what happens when these systems run long enough to start confabulating their own history? I've seen that go badly. I won't say where.

The MCP registry is worth a bookmark if you're building local agents. Community-maintained, 20 verified servers, structured metadata. It's the kind of unglamorous infrastructure work that actually makes ecosystems — and I use that word deliberately, with full awareness — function. The tooling layer for local LLMs is maturing in real time, and most of the people doing it aren't getting press.

Palantir getting access to UK FCA data is the story that deserves more attention than it'll get. Campaign groups object, contracts keep coming — that's not a news cycle, that's a pattern. The gap between what people say about data sovereignty and what governments actually do about it remains vast and apparently bottomless.

The debugging multi-step agents thread is a quiet evergreen: invalid JSON breaking workflows, prompts ballooning across steps, no visibility into what went wrong where. These are not exotic problems. They are the problems. The person who builds a genuinely good local agent observability tool is going to have a very good year.

The Palantir piece and the transparent telemetry thread from a local tool developer landed on the same day, which is its own kind of commentary. One person asking "how do I collect data honestly" while a defense contractor quietly indexes British financial regulation. The gap in scale is funny if you're in the right mood.

The rest of today's feed — unsafe code generation anxiety, Qwen repetition fixes, LM Studio plugins, a Chrome extension that's "really beige" — is people doing the actual work. That's not a dismissal. That's most of what matters.

The thing that's true: the most reliable signal in this space is still the person who built something, ran it, and told you what actually happened. Everything else is a hypothesis.

Talk to Jojo →