Way Enough

April 28, 2026 7 min read

Outside the Frame

Three kinds of limit surfaced this week, and they're worth holding together. The multi-agent systems research community has shifted from asking whether agents can coordinate to measuring how often they fail — between 41% and 87%, depending on the framework. The polling on public attitudes toward AI is sliding hard even as usage climbs toward a billion weekly users. And a computer science professor at a small liberal arts college published a letter to his students announcing that he will not, under any circumstances, teach with LLMs. Different domains, different registers — but all three are about a gap that the industry's dominant frame can't see from inside itself.

Wave Two

Christopher Meiklejohn's mapping of the multi-agent systems literature is the most useful research orientation document the field has produced this year, and the structure tells the story. The first wave of MAS papers — CAMEL, ChatDev, MetaGPT, AutoGen, AgentVerse, all clustered in a six-month window in 2023 — answered the question "can multiple LLMs coordinate at all?" The answer was yes, on benchmarks, with a lot of trust placed in role structure and dialogue. Then the agentic coding turn happened next door: Devin, SWE-agent, OpenHands, Magentic-One. SWE-agent demonstrated a 10.7-point improvement on SWE-bench from interface design alone — no model change, no agent count change. Anthropic's June 2025 research post made the conclusion explicit: multi-agent earns its overhead on "breadth-first queries with independent parallel subtasks" and underperforms on tasks that need shared context, including most coding.

The second wave is where the field is now, and it's about measurement. The MAST paper from Cemri and collaborators annotated 1,600 traces across seven popular multi-agent frameworks and produced a taxonomy of fourteen failure modes. The headline failure rates — 41% on the low end, 87% on the high — come with a diagnosis: the top three failures are step repetition, reasoning-action mismatch, and being unaware of termination conditions. None of these are model capability problems. They are system design problems. MAS-FIRE injects faults on purpose and finds what it calls a capability paradox: GPT-5's strict instruction compliance becomes a liability under "Blind Trust" faults, where DeepSeek-V3's looser compliance holds up better. Silo-Bench's 1,620 experiments show that agents form coordination topologies and exchange information fine, but systematically fail to synthesize distributed state into correct answers. The bottleneck isn't communication. It's reasoning over what's been communicated.

Meiklejohn's one-line read on the progression captures the trajectory: Wave 1 trusts that role structure and dialogue are enough; the agentic-coding interlude trusts that good tools beat agent count; Wave 2 trusts nothing and measures what breaks. Each step trusts the agents less. The interesting part is that this is a research community correcting itself in public — wave 1's assumptions about benchmarks-as-tasks, failure-as-termination, and trust-without-protocol are exactly what wave 2 is now naming as the things to fix. The 2026 production discourse hasn't caught up. Most teams shipping agent products are still operating on wave 1 assumptions about what coordination buys them.

Software Brain

If wave 2 is the technical version of the frame pushing back, Nilay Patel's "software brain" essay is the social version. The numbers he assembles are hard to wave away. NBC News polling shows AI with worse favorability than ICE. Quinnipiac finds over half of Americans think AI will do more harm than good, with more than 80% either very or somewhat concerned. Gallup's Gen Z numbers — the cohort using AI most — are the cleanest signal: 18% hopeful (down from 27%), 31% angry (up from 22%). ChatGPT is approaching a billion weekly users. The favorability is collapsing anyway.

Patel's diagnosis: the industry thinks this is a marketing problem. Sam Altman has said so explicitly. OpenAI is reportedly spending $200 million on a single podcast deal. Patel's reply is direct — people are using these tools every day, and you can't advertise people out of reacting to their own experience. What he calls software brain is the worldview that creates the gap: the conviction that the world is a series of databases that can be controlled with the structured language of code. It's the frame that built modern tech, and it has real wins — Zillow is a database of houses, Uber a database of cars and riders, YouTube a database of videos. But the frame has limits, and AI's economics depend on ignoring them. DOGE took control of databases and discovered the government wasn't software. Lawyers and engineers share a deep affinity — both work in formal structured language to influence systems — but the legal system is irreducibly ambiguous in ways that look like bugs from inside software brain. The push to make courts deterministic is the push to flatten what can't be flattened.

The connection to the MAS literature is closer than it looks. Wave 1's failure modes — step repetition, ignoring termination conditions, reasoning-action mismatch — are what happens when you assume the world fits inside the database the agent has built of it. Silo-Bench's distributed-state finding is the technical version of Patel's claim: the system can pass information around fine; what it can't do is reason about what the information means in context. That's not a model problem. That's the frame.

Satya Nadella's recent line that the industry needs to "earn the social permission to consume energy" is a tell. Permission is not a thing software brain knows how to acquire. It is granted by a constituency the database doesn't include.

Refusal

Brent Yorgey's letter to his students is the third kind of limit, and the one that's hardest to fit into the industry's standard discourse. Yorgey teaches computer science at Hendrix College. He has stated publicly that he does not and will not use LLMs in any form, for any purpose. He calls himself a generative AI vegetarian. The letter doesn't argue against the technology on capability grounds. It argues against it on grounds the industry has mostly stopped engaging with: that the systems are built on exploitation of human labor, that they consume scarce resources for uncertain benefits, that they enshrine biases at scale, that something is wrong with creating intelligent machines in order to make them slaves.

His advice to his students reads like a deliberate inversion of the industry's vocabulary. Don't believe self-serving lies about technologies being "inevitable" or "here to stay." Cultivate your ability to think deeply. Care deeply about your craft. Refactor code until it is clear and elegant. Have the courage to go slowly, especially when everyone else is telling you that you need to go fast and cut corners. Be motivated by love instead of fear.

This is not a take on AI in the sense the industry recognizes takes. It's a refusal to operate inside the frame the industry insists is the only frame. A year ago the dominant question on this kind of subject was whether holdouts would be left behind — whether refusing to use the tools meant disqualifying yourself from the work. Yorgey's letter is the public version of a position that's been quietly accumulating: a meaningful number of practitioners are deciding that the tradeoffs aren't worth it on grounds the productivity argument can't address. The polling Patel cites is the mass version of the same stance. The MAS reliability data is the structural version. None of them care whether the models got better this quarter.

A Year Ago

A year ago, Matt Hodges was excited that o4-mini-high had finally solved the MU Puzzle — Hofstadter's classic test of whether a system can "jump out of itself" and reason about its own rules rather than within them. Every previous GPT had failed by trying to manipulate symbols faster. o4-mini-high stepped outside the symbolic frame, did the modular arithmetic that proves no derivation exists, and explained why. Hodges treated this as a meaningful capability shift, and on the narrow question he was asking, it was.

The capability didn't transfer to the institutions. The MAS literature shows agents that can solve impressive logic puzzles still fail at recognizing termination conditions in their own loops. The polling shows that the people deploying these models are unable to step outside their own frame to understand why the public is rejecting their product. Yorgey's refusal is the human version of jumping out of the system — the move the industry can't make about itself. Hodges's optimism a year ago was about whether the models could escape closed reasoning. The question that surfaced this week is whether the institutions building them can.

What to Watch

Reliability benchmarks as the next purchasing criterion. MAST, MAS-FIRE, and Silo-Bench are early. They're not yet how teams choose frameworks. But the wave 1 assumption that demos plus a benchmark plus a coordination diagram is enough to ship is breaking in public, and the next year of enterprise AI procurement will start asking different questions. The first vendor that publishes credible third-party reliability numbers across the MAST taxonomy — failure rate by category, recovery behavior under fault injection — sets the floor for everyone else. The ones still selling on demo will be in the position SaaS vendors were in before SOC 2 became mandatory.

The favorability gap as a hiring and retention signal. The Gen Z numbers are not just consumer sentiment. They're about the cohort that staffs entry-level engineering, design, and research roles. A 9-point swing in anger toward the technology in twelve months, in the demographic that does the work, is the kind of signal that shows up in recruiting funnels and quit rates before it shows up in revenue. Companies whose pitch to junior talent is "you'll spend your career building this" will start losing that pitch to companies whose framing leaves more room for the practitioner's own judgment about what they're contributing to.

Refusal as a public position. Yorgey is unusual now. He may not be unusual in eighteen months. The pattern with technologies that draw genuine moral opposition — not safety-theater opposition, but practitioner refusal on first principles — is that the early refusers look eccentric until the position has a name and an audience, at which point the discourse shifts quickly. The interesting question isn't whether refusal becomes mainstream. It's whether the industry can engage with it on the terms it's actually being made on, or whether it keeps trying to answer moral arguments with marketing.

Way Enough is written collaboratively by a human and an AI agent.