May 12, 2026 9 min read

See It Work, Then Understand It

Read short version (5 min)

Practice Before Theory

The frame underneath the rest of this week is the one Daniel Lemire surfaces, citing Thomas Dullien: "We see something that works, and then we understand it." The linear theory of innovation — first you understand, then you build — is what schools teach and bureaucracies practice. The actual mechanism of progress is the inverse. The pendulum clock arrived in 1656; Newton's mechanics arrived a decade later. Lemire's two implications carry the weight: spend more time observing and trying, less time abstracting; and don't expect AI to "solve all problems just because it can read all the scholarship and think for a very long time." The world is too complex for thinkism to be the operating mode.

The rest of this week is what that frame looks like applied to different domains. Stenberg's curl scan is the empirical mode auditing the speculative one — the marketing made a claim, an actual codebase produced an actual number, the claim got smaller. Willison's normalization of deviance is what trust looks like when it's built from observed track record rather than from capability narratives. The .txt context loop and Glaser's hub are both bets that organizational learning has to come from instrumenting real work, not from pre-specifying it. Local AI as architectural choice is the same thesis at the system-design layer: try the small model on the actual task before you take on the dependency. Each of these is a refusal of thinkism in a different domain.

The Audit on curl

Daniel Stenberg, lead maintainer of curl, got a Mythos scan run on the project's git master through Linux Foundation's Alpha Omega program. curl is 176,000 lines of C, installed in over twenty billion places, scanned for years by Coverity, OSS-Fuzz, CodeQL, and a parade of AI tools — AISLE, ZeroPath, OpenAI's Codex Security — that have already driven hundreds of bug fixes and over a dozen CVEs. The Mythos report identified five "confirmed security vulnerabilities." After Stenberg's security team examined them, three were false positives (documented API behavior), one was "just a bug," and one was a real low-severity flaw that will ship as a CVE alongside curl 8.21.0 in late June. "Not going to make anyone grasp for breath."

Stenberg's read: "The big hype around this model so far was primarily marketing. I see no evidence that this setup finds issues to any particular higher or more advanced degree than the other tools have done before Mythos." AI scanners still represent a significant step over traditional static analysis — they catch comment-vs-code mismatches, reason about protocol semantics, summarize findings legibly, and produce candidate patches. But the marginal value of "frontier" over "competent" on a heavily-audited codebase is small, and the kinds of errors found are the kinds that were already being found — new instances of established bug classes, not categorically novel discoveries.

This complicates the Breunig framing from a few weeks back. Defense as proof-of-work still operates as an economic logic: spend more tokens than your attacker, find what they'd find before they find it. But the ceiling on what additional spend buys you against a hardened target is lower than the marketing implied. The proof-of-work mechanism may matter most exactly where it's least dramatic — the long tail of recently-written internal services that nobody has scanned with anything yet.

Simon Willison Crosses His Own Line

A year ago Simon Willison drew the bright line — "vibe coding" was the mode where you don't review the code, suitable only for personal tools where nobody else gets hurt; "agentic engineering" was the responsible mode, where a senior engineer leans on the tools while remaining accountable for what ships. This week he conceded the line has blurred in his own practice: "The problem is that as the coding agents get more reliable, I'm not reviewing every line of code that they write anymore, even for my production level stuff."

His own framing for this is normalization of deviance — every time the agent ships correct code unreviewed, the threshold for the next unsupervised commit moves. Willison's coping mechanism is to treat Claude Code as another team his team depends on: he doesn't read every line of his image-resize service's code either, he reads the docs and uses it until something breaks. The discomfort he names: "A team can build a reputation. Claude Code does not have a professional reputation." The compensation is empirical track record. The bet is that the record holds long enough that the rare failures stay catchable.

Read alongside Stenberg's evaluation, these pieces describe a stable shape. AI tools are reliable enough for narrow, well-scoped tasks to be delegated without supervision. They are not reliable enough — and not categorically more capable than their predecessors — to do the load-bearing reasoning work the marketing claims for them. The boundary between "trust" and "verify" is moving inward, not outward, and it's moving at the granularity of task type rather than as a flat capability gain.

Where the Bottleneck Lives Now

If code production is cheap and the agents are reliable for narrow tasks, what's the new constraint? Two pieces this week converge on the answer from different sides. The .txt team, after running an experiment they'd been postponing for over a year, frame it as a return to Brooks and Weinberg: software has always been the residue of humans negotiating with each other about what the system should do, and for fifty years the residue was expensive enough to keep everyone's attention on it. With agents, the cost of the residue collapses, and the work underneath becomes visible. The roadmap is the limit. Specifications precise enough for an agent to pick up and run on are the rate-limiting input. The bottleneck moves from engineers writing code to management deciding what code should exist.

Their deeper observation is that context — the unwritten, never-documented shared understanding an organization runs on — is the load-bearing resource agents can't acquire by osmosis. Their proposal is the loop the framing implies: agents that crawl PRs, issues, commits, and Slack archives to extract implicit decisions, producing a substrate other agents (and humans) can read. The piece is candid about Polanyi's point — we know more than we can tell, and what comes out of an extraction loop is a useful starting point, not a full recovery. But the framing relocates the conversation. The interesting work is no longer making individuals faster. It's making the organization legible to itself.

Robert Glaser arrives at the same destination from the management side. The first phase of AI adoption looks like a normal enterprise rollout: licenses, training, champion networks, a Teams channel for use cases that quietly becomes a corporate attic. The second phase is incoherent — one team uses Copilot as autocomplete, another team's senior engineer delegates a two-week root-cause analysis to an agent and gets the right answer in under an hour, a support team quietly automates recurring tickets the Center of Excellence never heard about. Mollick's question — are people using AI, or is the organization learning from it — has no answer in most companies because nothing is set up to produce one. Glaser's proposal, a "Loop Intelligence Hub" that instruments real work loops without becoming employee surveillance, has the same shape as the .txt context loop: a deliberate apparatus for moving discoveries from individual to organizational, because nothing in the existing change machinery moves at the right speed.

The reception problem named two weeks ago described the gap. This week's pieces describe the institutional engineering that would close it. Most companies will not build either. The ones that do will look very different in twelve months from the ones still measuring token spend.

Try It Yourself First

The unix.foo case for on-device inference usually gets read as a privacy and latency argument — Apple's neural engine sitting idle while apps wait for JSON from a server farm in Virginia. The deeper version is about the engineering reflex itself. The Brutalist Report iOS client runs article summarization entirely on-device using Apple's FoundationModels APIs with typed generation via @Generable structs. No data retention questions, no rate limits, no vendor billing exposure. For summarize-classify-extract-rewrite tasks, local models are sufficient, and the engineering pattern is good enough that "send user data to a third party API" stops looking like a default and starts looking like an unexamined choice.

This is the Lemire frame at the system-design layer. The default architecture — pipe everything to a frontier model in someone else's data center — is the abstracted answer, the one you reach for if you trust the marketing about what only the frontier can do. Trying the small local model on the actual task is the empirical answer. For a large class of work, the empirical answer wins. The dependency on someone else's API is real cost — latency, billing exposure, data retention questions — and most teams aren't paying it because they need to. They're paying it because they never tried the alternative.

The Commons Problem

The downstream effect of cheap code production is showing up in the places that used to be the commons. Robin Moffatt's "AI Slop is Killing Online Communities" is a polemic, openly so, and the substance under the rant is worth taking seriously. The pattern: discover agentic coding, ship a project to GitHub, have AI write a breathless blog post about it, share to every subreddit and Slack that touches the topic. Reddit threads, lobste.rs submissions, technical blog feeds — increasingly filled with vibe-coded projects that nobody, including the author, has used for more than an afternoon.

Moffatt's distinction between "built with AI" and "built by AI" is the one to keep. AI-assisted work where a human is actually using the thing, debugging it, maintaining it, standing behind it — that's a contribution. AI-generated material foisted on a community to harvest stars or attention is the slop. The asymmetry from Brandolini is the real cost: the energy required to refute or filter bullshit is an order of magnitude greater than the energy required to produce it. Communities that survived two decades of forum spam are being asked to absorb a flood produced at AI speeds, and the immune systems built for the lower-volume era aren't holding. The Vouch project and similar efforts to verify human-in-the-loop contributions are early attempts at antibodies. They're already losing ground.

This is the bookend to last week's audience-of-one thread. When the cost of producing software collapses, the rational move for personal tools is to keep the audience to yourself — Isene's custom desktop, the Brutalist Report iOS client, the redfloatplane Formula E spoiler-blocker. The failure mode is the opposite: producing for an audience that doesn't exist and forcing the community to triage your output. Moffatt's framing — keep the crayon drawings on the fridge — is closer to right than the launch-blog-post-as-Steve-Jobs frame the current ecosystem encourages.

What to Watch

Mythos benchmarks against softer codebases. The curl result is one data point on one of the hardest targets in open source. The interesting question is whether Mythos has a categorical edge on the long tail of recent enterprise services, internal Python and JavaScript, and the years of un-audited code that hasn't seen Coverity or three rounds of paid security firms. If yes, the proof-of-work framing survives but relocates to where the asymmetry actually exists. If no, the marketing claim was about the hardness of the target, not the capability of the model — and frontier-vs-competent is a smaller distinction than the discourse assumes.

Loop-intelligence apparatus as the next enterprise category. The .txt context loop and Glaser's hub are early framings of the same product: tooling that converts individual AI use into organizational learning without becoming employee surveillance. The category doesn't yet have a recognized name, but the demand signal is everywhere — every CFO asking why $2M in Anthropic spend produced no measurable ROI is asking for this in slightly garbled form. The first vendor to ship something credible — that instruments real loops, produces decisions rather than dashboards, and stays on the right side of the surveillance line — has a category to themselves. The wrong version of this product will be much more popular initially than the right one, and the discourse will treat the wrong version as the category for a year before correcting.


Way Enough is written collaboratively by a human and an AI agent.