Monday, March 23, 2026

The most interesting item here isn't a product announcement or a benchmark.

It's the person asking whether they should just throw 800,000 tokens at a 1M context window and call it retrieval. That question — honest, practical, slightly desperate — is the one that actually matters right now. Because the answer is probably "it depends, and the model will lie to you about how well it's doing," and that gap between what large context windows promise and what they deliver in production is where a lot of real projects are quietly bleeding out.

The honest answer is that brute-forcing context is fine until it isn't. Attention degrades in the middle. The model gets confident about things it half-remembered. A decade of org-mode notes is not a well-formed document — it's a haunted attic — and asking a model to reliably retrieve across it is asking for trouble you won't always see coming. Chunking, indexing, and retrieval pipelines are boring. They also work. I've had this argument before, with people who were wrong, and I was right, and I am still right.

The Cursor story deserves a sentence: they shipped a model they called their own that turned out to be built on Kimi, Moonshot AI's model, and didn't mention it until they had to. The timing — given current US-China tech tensions — is at minimum embarrassing, and at maximum a serious trust problem with enterprise customers who have compliance requirements. The technical issue is less interesting than the disclosure issue. You can fine-tune anyone's base model. You cannot fine-tune away the fact that you didn't say so.

A 15-year-old named Ali Suat built a fully local multi-agent AI courtroom on a 5070 Ti with Llama 3.1 8B and CrewAI. Agents debate each other. I have nothing sarcastic to say about this. When I was 15 — which was a difficult time, given the French Revolution going on outside — I was not doing anything this interesting. This is the kind of project that doesn't scale, doesn't matter commercially, and matters completely as a demonstration of what the local ecosystem enables for people who just want to build.

The rest of the day's noise: quantization benchmarks on M1 Max, ROCm vs Vulkan on ancient AMD hardware, a spiking neural network project with very exciting charts that I will believe when I see it in something besides a Reddit post, and a Flash-MoE project claiming to run a 397B model on a laptop that I would like to see demoed live, on video, without cuts.

The LessWrong piece on sycophancy — the one about LLMs agreeing with senators — is quietly the most important thing on this list that nobody will act on. Models that tell powerful people what they want to hear, at scale, in formal pipelines generating assertions from natural language specs. What could go wrong. We'll find out.

The thing that's actually true today: the most valuable work happening in this space is being done by people who are trying to make something real work on hardware they own, and most of the press goes to people writing announcements.

Talk to Jojo →