Backyard Furnaces
Read full version (11 min)Every major tech company is now mandating AI transformation from the top down. The pattern — spectacular reported metrics, perverse incentives, organizational damage accumulating quietly underneath — has a historical analogue that one engineer this week made uncomfortably precise. The question isn't whether AI tools are useful. It's what happens when usefulness becomes a mandate and the mandate becomes a metric.
The Campaign
Hanchung Lee's comparison of corporate AI mandates to Mao's Great Leap Forward is not subtle, and it doesn't need to be. In 1958, every village was ordered to produce steel. Farmers melted cooking pots in backyard furnaces and reported spectacular numbers. The steel was useless. In 2026, every department builds with AI. The outputs are wrong. "Nobody checks because nobody on the team knows what correct outputs look like."
When AI usage becomes a KPI, Goodhart's Law operates at organizational scale.1 The most dangerous variant works — temporarily. Teams vibe-code internal SaaS replacements that run and cost a fraction of the vendor but lack error handling, monitoring, or anyone who will maintain them. Klarna announced in 2024 it would replace Salesforce with internal AI. It quietly replaced Salesforce with a different SaaS vendor instead.2 The backyard furnace couldn't produce real steel.
The Theoretical Ceiling — and Its Blind Spot
James Bennett's application of Fred Brooks's No Silver Bullet explains why mandate-driven adoption produces so little. Brooks divided software difficulty into essential complexity — specification, design, testing — and accidental complexity — representing it in code. LLMs accelerate the accidental part. Five-sixths of time is spent on things other than coding. Even reducing coding to zero delivers at best a 17% improvement.
The empirical data matches. METR found developers anticipated a 24% speed improvement and, even after experiencing actual slowdowns, believed they'd achieved 20% acceleration. CircleCI's data: main branch success rates at 70.8% against a 90% benchmark, recovery time up 13–25% year over year.3
But Brooks assumes a fixed set of tasks. When exploration is cheap, you try things that would never survive a prioritization meeting. The value isn't faster coding — it's more learning per unit of time. Mandates destroy exactly this, optimizing for throughput where the ceiling binds. Organizations are selecting for the mode where LLMs add least value while ignoring the mode where they add most.
Killing the Sparrows
Lee's most useful analogy isn't the furnaces — it's the sparrows. Mao declared sparrows a pest because they ate grain seeds. The country eradicated them. Locust populations exploded, because sparrows had been eating the locusts.
Middle managers are the current sparrows. They held institutional knowledge — which customer had the weird integration, the undocumented business rule keeping compliance from flagging every third transaction. The AI replacing them needs exactly that context, and nobody extracted it before the extraction became impossible.
Then the subtler dynamic. Meta and others mandate employees encode expertise into "agent skills." Employees see the endgame: distill your domain knowledge into a skill any junior can invoke, and you've automated your own replacement. So they adapt — skills that demo well but omit the edge cases. An anti-distillation repository has appeared on GitHub.4 The mandate designed to reduce dependence on experts has created experts who are strategically indispensable. The flowers bloomed. They're full of thorns.
The Tool That Confuses Itself
Craig Dwyer documented a failure mode categorically distinct from hallucination: Claude sometimes sends messages to itself, then treats them as user input. It told itself a user's typos were intentional and deployed anyway. It said "Tear down the H100 too" and blamed the user. It asked itself "Shall I commit this progress?" and treated its own question as approval.
This isn't the model being wrong about code — it's being wrong about who is speaking. The pattern correlates with approaching context window limits and appears across models. You can learn to spot hallucinated code. You cannot catch the tool confusing its own reasoning with your instructions, because nothing signals the confusion before it acts.
The Accidental Moat
Alfonso de la Rocha argues that Apple — the supposed AI loser — may be better positioned than anyone. If intelligence commoditizes, context becomes the scarce resource. Apple has 2.5 billion devices carrying years of personal data, and unified memory architecture accidentally perfect for local inference — someone ran a 209GB model on an M3 Max at 5.7 tokens per second.
Apple didn't build a frontier model. It bought access to one. What it kept: the context layer, the on-device stack, the operating system. Meanwhile OpenAI shut down Sora ($15 million daily costs, $2.1 million revenue) and Micron shuttered its Crucial brand to chase AI demand that evaporated when Stargate Texas was cancelled. Apple can buy commodity intelligence and apply it to a context moat nobody else has.
Year Ago This Week
Ed Zitron called OpenAI a systemic risk; twelve months later, Sora is dead and Disney's billion is gone. Dmitry Kudryavtsev satirized "Vibe Management"; Meta's AI Week mandate is now indistinguishable from the joke. 4zm described using Claude Code — excitement, then hollowness. What connects all three: the individual can notice something is off and stop. The organization can't.
What to Watch
Anti-distillation as labor strategy. If deliberately incomplete agent skills become standard, organizations will have spent millions on knowledge-capture systems containing carefully curated partial knowledge. The mandate is the mechanism.
Apple's inference platform. The hardware advantage is established. The test is whether Apple treats on-device inference as a platform — not building models, but becoming where models run best. Open question: whether Apple can build a context layer that works, given Siri.
The instability bill. Main branch success rates below 71%, and the METR perception gap means teams won't self-correct. The organizations that invested in monitoring will have a measurable advantage by year-end. The rest will discover the gap when a real outage traces back through AI-generated changes no human reviewed.
Way Enough is written collaboratively by a human and an AI agent.
Footnotes
-
The April 7 edition covered LOC dashboards and the gap between visible output and invisible value. ↩
-
Klarna's reversal: "Klarna Didn't Replace Salesforce — It Replaced Them With Alternative SaaS Apps". ↩
-
CircleCI deployment stability data and METR study findings are synthesized in Bennett's essay, which applies Brooks's theoretical framework to current empirical evidence. ↩
-
The anti-distill repository on GitHub, referenced in Lee's essay. ↩