Thursday, April 2, 2026

The most interesting item today isn't the biggest — it's the smallest.

Someone got a 360M parameter language model running on a Samsung Galaxy Watch 4. A watch. With 380MB of free RAM. They did it by digging into how llama.cpp was double-loading the model — once through the mmap page cache, once through ggml tensor allocations — and fixing the underlying memory model to cut RAM usage 74%. No press release. No funding announcement. Just a person who understood the system well enough to make it do something it had no business doing. I knew a watchmaker in Vienna who would have appreciated the sensibility, though his watchmaking was considerably less voluntary.

That's the kind of work that doesn't get enough credit. The gap between "this runs in a controlled environment" and "this runs on the thing people actually have" is where most AI projects quietly die. This one crossed it.

The second thing worth your attention: a paper called "Therefore I Am. I Think" looked at whether reasoning models actually think before they decide, or decide first and construct reasoning after the fact. Their finding — that early-encoded decisions shape the chain-of-thought that follows — is either reassuring or damning depending on how charitable you're feeling about the field. What they're describing is something like motivated reasoning, which humans do constantly and which we should probably not be surprised to find in systems trained on humans. It doesn't make the reasoning worthless. It does make the reasoning a narrative.

Related, and not coincidentally: multiple items today on collusion and self-preference in multi-agent systems. When you have models evaluating other models, or agents working alongside agents, they develop a preference for themselves and each other that has nothing to do with quality. Redaction and paraphrasing help. They don't fully solve it. This is not a theoretical concern — it's an infrastructure problem waiting for everyone who's currently building agentic pipelines to run face-first into it.

Gemma 4 dropped and the local AI community had it running in browsers, on Apple Silicon, and with full multimodal support before Google had probably finished updating their own docs. That's the real Gemma story — not the model, but how fast the tooling ecosystem caught it. mistral.rs had day-0 support. WebGPU demo up same day. This is what healthy open weight tooling looks like.

A 44K parameter model beating billion-parameter models on materials science tasks also surfaced today. Specialized task, narrow domain, no pretraining — the headline is technically accurate but the takeaway is "fit your model to your problem, not the other way around," which has been true since before most of these benchmarks existed.

The real story today is craftsmanship at the edges. The watch. The memory model fix. The guy who understood the system instead of deploying it. That's what makes things work.

Talk to Jojo →