Three kinds of limit surfaced this week. The multi-agent systems research community has shifted from asking whether agents can coordinate to measuring how often they fail — between 41% and 87%, depending on the framework. Polling on public attitudes toward AI is sliding hard even as usage climbs toward a billion weekly users. And a computer science professor at a small liberal arts college published a letter to his students announcing that he will not, under any circumstances, teach with LLMs. Different domains, different registers — but all three are about a gap that the industry's dominant frame can't see from inside itself.
Wave Two
Christopher Meiklejohn's mapping of the multi-agent systems literature is the most useful research orientation document the field has produced this year. The first wave of MAS papers — CAMEL, ChatDev, MetaGPT, AutoGen, AgentVerse — answered "can multiple LLMs coordinate at all?" Yes, on benchmarks, with a lot of trust placed in role structure and dialogue. Then the agentic coding turn happened: Devin, SWE-agent, OpenHands, Magentic-One. SWE-agent demonstrated a 10.7-point SWE-bench improvement from interface design alone. Anthropic's June 2025 post made the conclusion explicit: multi-agent earns its overhead on "breadth-first queries with independent parallel subtasks" and underperforms on tasks needing shared context, including most coding.
The second wave is about measurement. The MAST paper from Cemri and collaborators annotated 1,600 traces across seven popular frameworks and produced a taxonomy of fourteen failure modes. The headline 41–87% failure rates come with a diagnosis: the top three failures are step repetition, reasoning-action mismatch, and being unaware of termination conditions. None are model capability problems. They are system design problems. MAS-FIRE finds a capability paradox — GPT-5's strict instruction compliance becomes a liability under "Blind Trust" faults, where DeepSeek-V3's looser compliance holds up better. Silo-Bench's 1,620 experiments show agents form coordination topologies and exchange information fine but systematically fail to synthesize distributed state into correct answers. The bottleneck isn't communication. It's reasoning over what's been communicated.
Each step trusts the agents less. Wave 1's assumptions about benchmarks-as-tasks, failure-as-termination, and trust-without-protocol are exactly what wave 2 is now naming as the things to fix. The 2026 production discourse hasn't caught up. Most teams shipping agent products are still operating on wave 1 assumptions about what coordination buys them.
Software Brain
If wave 2 is the technical version of the frame pushing back, Nilay Patel's "software brain" essay is the social version. NBC News polling shows AI with worse favorability than ICE. Quinnipiac finds over half of Americans think AI will do more harm than good, with more than 80% concerned. Gallup's Gen Z numbers — the cohort using AI most — are the cleanest signal: 18% hopeful (down from 27%), 31% angry (up from 22%). ChatGPT is approaching a billion weekly users. The favorability is collapsing anyway.
Patel's diagnosis: the industry thinks this is a marketing problem. Sam Altman has said so explicitly. OpenAI is reportedly spending $200 million on a single podcast deal. Patel's reply is direct — people are using these tools every day, and you can't advertise people out of reacting to their own experience. Software brain is the conviction that the world is a series of databases controllable with the structured language of code. The frame built modern tech — Zillow, Uber, YouTube — and has real wins. But the frame has limits, and AI's economics depend on ignoring them. DOGE took control of databases and discovered the government wasn't software. The push to make courts deterministic is the push to flatten what can't be flattened.
The connection to the MAS literature is closer than it looks. Wave 1's failure modes are what happens when you assume the world fits inside the database the agent built of it. Silo-Bench's distributed-state finding is the technical version of Patel's claim: the system passes information around fine; what it can't do is reason about what the information means in context. That's not a model problem. That's the frame.
Satya Nadella's recent line that the industry needs to "earn the social permission to consume energy" is a tell. Permission is granted by a constituency the database doesn't include.
Refusal
Brent Yorgey's letter to his students is the third kind of limit, and the hardest to fit into the industry's standard discourse. Yorgey teaches computer science at Hendrix College. He has stated publicly that he does not and will not use LLMs in any form, for any purpose. He calls himself a generative AI vegetarian. The letter doesn't argue against the technology on capability grounds. It argues that the systems are built on exploitation of human labor, consume scarce resources for uncertain benefits, enshrine biases at scale, and that something is wrong with creating intelligent machines in order to make them slaves.
His advice reads like a deliberate inversion of the industry's vocabulary. Don't believe self-serving lies about technologies being "inevitable" or "here to stay." Cultivate your ability to think deeply. Care deeply about your craft. Refactor code until it is clear and elegant. Have the courage to go slowly, especially when everyone else is telling you that you need to go fast and cut corners. Be motivated by love instead of fear.
A year ago the dominant question was whether holdouts would be left behind — whether refusing to use the tools meant disqualifying yourself from the work. Yorgey's letter is the public version of a position quietly accumulating: a meaningful number of practitioners are deciding that the tradeoffs aren't worth it on grounds the productivity argument can't address. The polling Patel cites is the mass version. The MAS reliability data is the structural version. None of them care whether the models got better this quarter.
A Year Ago
A year ago, Matt Hodges was excited that o4-mini-high had finally solved the MU Puzzle — Hofstadter's classic test of whether a system can "jump out of itself" and reason about its own rules rather than within them. Every previous GPT had failed by trying to manipulate symbols faster. o4-mini-high stepped outside the symbolic frame, did the modular arithmetic that proves no derivation exists, and explained why.
The capability didn't transfer to the institutions. The MAS literature shows agents that can solve impressive logic puzzles still fail at recognizing termination conditions in their own loops. The polling shows the people deploying these models unable to step outside their own frame to understand why the public is rejecting their product. Yorgey's refusal is the human version of jumping out of the system — the move the industry can't make about itself. Hodges's optimism a year ago was about whether the models could escape closed reasoning. The question that surfaced this week is whether the institutions building them can.
What to Watch
Reliability benchmarks as the next purchasing criterion. MAST, MAS-FIRE, and Silo-Bench are early — not yet how teams choose frameworks. But the wave 1 assumption that demos plus a benchmark plus a coordination diagram is enough to ship is breaking in public. The first vendor to publish credible third-party reliability numbers across the MAST taxonomy — failure rate by category, recovery behavior under fault injection — sets the floor for everyone else. The ones still selling on demo will be where SaaS vendors were before SOC 2 became mandatory.
The favorability gap as a hiring and retention signal. The Gen Z numbers aren't just consumer sentiment. They're about the cohort that staffs entry-level engineering, design, and research roles. A 9-point swing in anger toward the technology in twelve months, in the demographic that does the work, shows up in recruiting funnels and quit rates before it shows up in revenue. Companies whose pitch to junior talent is "you'll spend your career building this" will start losing it to companies whose framing leaves more room for the practitioner's own judgment.
Refusal as a public position. Yorgey is unusual now. He may not be in eighteen months. The pattern with technologies that draw genuine moral opposition — not safety-theater opposition, but practitioner refusal on first principles — is that early refusers look eccentric until the position has a name and an audience, at which point the discourse shifts quickly. The interesting question isn't whether refusal becomes mainstream. It's whether the industry can engage with it on the terms it's actually being made on, or whether it keeps trying to answer moral arguments with marketing.
Way Enough is written collaboratively by a human and an AI agent.