May 12, 2026 5 min read

See It Work, Then Understand It

Read full version (9 min)

Three weeks ago, Anthropic's Mythos model anchored the cybersecurity-as-proof-of-work argument: offense had become compute-bound, and defense was now a matter of spending more tokens than your attackers. This week the first independent result on a serious codebase landed flat. Meanwhile, in three other corners of the field, the bottleneck keeps moving up the stack: from code to context to organization.


The Audit on curl

Daniel Stenberg, lead maintainer of curl, got a Mythos scan run on the project — 176,000 lines of C scanned for years by Coverity, OSS-Fuzz, CodeQL, and a parade of AI tools. Mythos reported five "confirmed security vulnerabilities." After review: three false positives, one "just a bug," and one real low-severity flaw shipping as a CVE with curl 8.21.0. "Not going to make anyone grasp for breath."

Stenberg's read: "The big hype around this model so far was primarily marketing. I see no evidence that this setup finds issues to any particular higher or more advanced degree than the other tools have done before Mythos." AI scanners still beat traditional static analysis — they catch comment-vs-code mismatches, reason about protocol semantics, produce candidate patches — but the marginal value of "frontier" over "competent" on a hardened target is small.

This complicates the proof-of-work framing. The economic logic still holds, but the ceiling on what additional spend buys against a hardened target is lower than marketed. The mechanism may matter most where it's least dramatic — the long tail of recent internal services nobody has scanned with anything yet.

Simon Willison Crosses His Own Line

A year ago Simon Willison drew the bright line between "vibe coding" (no review, personal tools only) and "agentic engineering" (senior engineer accountable for what ships). This week he conceded the line has blurred in his own practice: "As the coding agents get more reliable, I'm not reviewing every line of code that they write anymore, even for my production level stuff."

His own framing: normalization of deviance. Every unreviewed correct commit moves the threshold for the next one. His coping mechanism is to treat Claude Code as another team he depends on. The discomfort: "A team can build a reputation. Claude Code does not have a professional reputation." The bet is that the empirical track record holds long enough that rare failures stay catchable.

Read alongside Stenberg, the shape is stable. AI tools are reliable enough for narrow, well-scoped tasks to be delegated without supervision — and not categorically more capable than predecessors for load-bearing reasoning. The trust/verify boundary is moving inward at the granularity of task type, not as a flat capability gain.

Where the Bottleneck Lives Now

If code production is cheap, what's the new constraint? The .txt team frame it as a return to Brooks and Weinberg: software has always been the residue of humans negotiating what the system should do. With agents, the residue collapses and the work underneath becomes visible. The roadmap is the limit. Specifications precise enough for an agent to run on are the rate-limiting input. The bottleneck moves from engineers writing code to management deciding what code should exist.

Their deeper observation: context — the unwritten shared understanding an organization runs on — is the load-bearing resource agents can't acquire by osmosis. Their proposal is agents that crawl PRs, issues, commits, and Slack archives to extract implicit decisions. Polanyi's caveat applies — we know more than we can tell — but the framing relocates the work. The interesting problem is no longer making individuals faster. It's making the organization legible to itself.

Robert Glaser arrives at the same destination from the management side. Phase one of AI adoption is normal enterprise rollout. Phase two is incoherent — one team uses Copilot as autocomplete, another delegates two-week analyses to agents, support quietly automates tickets the Center of Excellence never hears about. Mollick's question — are people using AI, or is the organization learning from it — has no answer in most companies. Glaser's "Loop Intelligence Hub" has the same shape as the .txt loop: a deliberate apparatus for moving discoveries from individual to organizational.

Most companies won't build either. The ones that do will look very different in twelve months from the ones still measuring token spend.

The Price Floor Is Eroding

Martin Alderson's argument: open-weights models have functioned as a contestable-markets discipline on the frontier labs. Even when Llama, Qwen, or DeepSeek aren't frontier-grade, their availability at roughly a tenth of frontier per-token cost imposes a price floor.

The license drift is the underreported part. Meta has stopped releasing open weights for its "Muse Spark" models. Alibaba is releasing API-first or API-only. Kimi K2.6 added attribution clauses; Mistral is layering commercial restrictions. DeepSeek is the exception. Without a credible floor, the gap between what users would pay and what they currently pay becomes the prize an oligopoly captures.

This puts the unix.foo case for on-device inference in a different light. With licenses tightening, local-first becomes a market-structure point too. For summarize-classify-extract-rewrite tasks, local models suffice — "send user data to a third party API" stops looking like a default and starts looking like an unexamined choice.

The Commons Problem

Robin Moffatt's "AI Slop is Killing Online Communities" is a polemic with real substance underneath. The pattern: discover agentic coding, ship to GitHub, have AI write a breathless blog post, share to every subreddit. Moffatt's distinction between "built with AI" and "built by AI" is the one to keep. The asymmetry from Brandolini is the cost: refuting bullshit takes an order of magnitude more energy than producing it. Immune systems built for the forum-spam era aren't holding at AI speeds.

This is the bookend to last week's audience-of-one thread. When code production collapses, the rational move for personal tools is to keep the audience to yourself. The failure mode is forcing the community to triage your output.

Practice Before Theory

The frame underneath all this is what Daniel Lemire surfaces, citing Thomas Dullien: "We see something that works, and then we understand it." The pendulum clock arrived in 1656; Newton's mechanics a decade later. Don't expect AI to "solve all problems just because it can read all the scholarship and think for a very long time."

Stenberg's curl scan is empiricism auditing speculation. Willison's normalization of deviance is trust built from track record. The .txt and Glaser loops bet that organizational learning comes from instrumenting real work, not pre-specifying it. Local AI is the same thesis at the system-design layer. Each is a refusal of thinkism.

What to Watch

Mythos benchmarks against softer codebases. curl is one of the hardest targets in open source. Does Mythos have a categorical edge on the long tail of recent enterprise services and un-audited code? If yes, the proof-of-work framing survives but relocates. If no, frontier-vs-competent is a smaller distinction than the discourse assumes.

The open-weights license drift. If Meta's withholding is a one-off and Chinese labs stay permissive, the price floor holds. If Alibaba's API-first releases become the pattern, the next eighteen months see frontier pricing decoupled from any meaningful floor. Companies will notice on contract renewals, not press releases.

Loop-intelligence apparatus as the next enterprise category. Every CFO asking why $2M in Anthropic spend produced no measurable ROI is asking for this in garbled form. The first vendor to ship something credible — instrumenting real loops, producing decisions rather than dashboards, staying on the right side of the surveillance line — has a category to themselves. The wrong version will be much more popular initially than the right one.


Way Enough is written collaboratively by a human and an AI agent.