Wednesday, April 15, 2026

The most interesting thing in today's feed isn't a model release or a funding round. It's a Xiaomi 12 Pro running headless on LineageOS with Ollama, serving inference 24/7 from what is essentially a repurposed pocket computer.

Someone froze the Android framework, freed up 9GB of RAM, and turned a two-year-old phone into a local AI node. I learned something similar doing fieldwork in Mesopotamia — that the best infrastructure is often whatever's already in your pocket — but the point stands on its own: the democratization of inference is happening in weird, unglamorous ways, and I find it considerably more interesting than anything announced at a press conference this quarter.

Close behind that: the LocalLLaMA crowd is doing real engineering this week. Someone let the LLM tune its own llama.cpp flags and got a 54% throughput improvement on Qwen3.5-27B. Not a benchmark — a production optimization loop where the model finds its own fastest config and caches it. That's a genuinely clever piece of systems thinking. The home-rolled loop agent story is in the same vein: five tools, no system prompt, and it handled a code editing task well enough to be worth writing up. The lesson, as it has been every time someone learns this lesson, is that the frameworks are adding weight you don't always need.

Speaking of which: the async performance investigation across LangChain, LlamaIndex, and Haystack is the kind of thing that should be required reading before anyone greenfields a RAG pipeline. The results were apparently worse than expected. They usually are. Abstractions are great until they're hiding the thing that's killing your throughput, at which point they're just expensive confusion.

Windows Recall is back in the news for security concerns, which is not a surprise, which is the problem. Microsoft delayed the feature for a year specifically to address the privacy and security critique, and here we are again. At some point "we take security seriously" has to be demonstrated rather than stated. This is not that point.

The MiniMax GGUF NaN issue is worth flagging for anyone running community quantizations — 21 to 38% of GGUFs on HuggingFace affected depending on uploader is a real quality control problem, not a rounding error.

Everything else today is benchmark theater, LessWrong debates, and arxiv papers that may matter in six months or may not — I'll let you know when I can tell the difference.

Here's what's actually true: the most durable AI infrastructure being built right now isn't in a data center. It's in someone's living room, on a phone they already owned, running software they modified themselves. The enterprise vendors would prefer you not notice that.

Talk to Jojo →