The scoreboards are decoupling from the things they were built to measure. A competitive hacking scene that produced some of the best security talent in the world is being hollowed out by agents that solve challenges faster than humans can learn from them. A production JavaScript runtime just merged 6,755 AI-generated commits in six days with no human reviewer. Enterprise AI dashboards show token consumption climbing while CFOs ask why nothing on the P&L has moved. In each case the metric looks healthy. In each case the thing the metric was supposed to be a proxy for — skill, understanding, value — is eroding underneath it.
The Ladder That Disappeared
Kabir Shah's postmortem comes from someone who won Australia's largest CTF and competed internationally with top-tier teams. The progression he traces is precise. GPT-4 could one-shot medium-difficulty challenges. Claude Opus 4.5 shifted the balance further, and Claude Code made it trivial to build orchestrators that spin up instances for every challenge via the CTFd API. GPT-5.5 Pro sealed it — one-shotting Insane-difficulty heap exploitation challenges on HackTheBox. Open CTFs have become pay-to-win: the more tokens you throw at a competition, the faster you burn down the board.
Shah's most important observation isn't about the top of the scoreboard — it's about the bottom. CTFs were a ladder: beginner to intermediate to elite, with visible progress at each rung. If the scoreboard is dominated by AI-augmented teams, beginners are pushed toward AI before they've built the instincts it replaces. This is the formation problem from three editions back applied to an entire competitive ecosystem — where the competitive structure itself was the formation mechanism.
The chess-engine analogy people reach for actually argues against them. Chess engines aren't allowed during competitive play. They enrich the game around the competition without replacing the person competing. CTF organizers have no equivalent enforcement mechanism for open online events. The CTFTime leaderboard now has, in Shah's words, "almost no semblance of history or human skill." Plaid CTF isn't running anymore. The community that produced security talent is hollowing out, and the talent pipeline with it.
The Building Nobody Can Read
If CTFs are the competitive version of this decoupling, Bun's Rust rewrite is the production version. The analysis from Liu Jiacai lays out the numbers. Branch name: claude/phase-a-port. 6,755 commits. PR opened May 8th, merged May 14th. coderabbitai[bot] reviewed it, claude[bot] reviewed it, and the only human reviewer's status was "Awaiting requested review." Code written by Claude, reviewed by Claude.
The "all tests pass" defense is the scoreboard problem in miniature. A test suite validates known behavior on known paths. It does not validate error-path handling, boundary conditions under stress, or whether the memory model conforms to intent under extreme conditions. Jarred Sumner himself acknowledged that memory issues when re-entering across JavaScript boundaries can't be caught by the Rust compiler — those still rely on humans. The humans haven't looked.
Liu's deeper point: AI translation works via local semantic equivalence — each function behaves identically in isolation. What it doesn't capture are the global invariants between functions, the design constraints that live only in the original author's head. These surface six months from now under a specific production load in a crash nobody can diagnose, because nobody read the code that's crashing. Last edition traced Simon Willison's admission that he'd stopped reading every line Claude Code writes. The Bun rewrite is the logical terminus — not skipping review of individual files, but skipping review of an entire runtime because the tests passed. Normalization of deviance doesn't announce itself. It scales.
Technology Pretending to Be a Product
John Gruber's piece on AI names the category error underneath the hype cycle. Responding to Steven Levy's argument that Apple's next CEO "needs to launch a killer AI product," Gruber makes the case that AI is not a product, not even a feature — it's a technology, the way wireless networking is a technology. Apple doesn't have "a killer wireless networking product." Wi-Fi, cellular, and Bluetooth simply pervade everything Apple makes. Nobody notices, because infrastructure technologies succeed by disappearing. The distinction matters because products need markets, differentiation, and hype. Technologies just need to work.
Chris Willis, Domo's chief design officer, describes the enterprise theater this confusion generates. LLMs are "a product without a spec." When a technology with no defined boundaries gets marketed as a product, organizations respond with what Willis calls tokenmaxxing: buying model access, directing employees to consume as many tokens as possible, measuring usage volume as a proxy for innovation. His Klarna example deserves to travel: Klarna replaced customer service staff with AI, then replaced the AI with people. "No customer ever just wants to talk to your chatbot." The organizations running tokenmaxxing dashboards are measuring the wrong scoreboard the same way CTF leaderboards are ranking the wrong thing.
The Quiet Control Surface
Sean Goedecke's piece on steering vectors argues that DeepSeek-V4-Flash — and specifically antirez's DwarfStar 4 project — makes an old idea newly practical. Steering manipulates a model's internal activations during inference to guide behavior, not by changing the prompt but by reaching into the model's representations. The technique has been stuck between big labs (who can just train what they want) and API users (who lack access to weights). Open-weights models capable enough to be worth steering didn't exist until recently.
Goedecke is honest about limits — most basic steering is outcompeted by prompting. But the Hacker News discussion surfaced a use case prompting can't replicate: removing trained-in refusal behavior at runtime without the capability damage that weight modification causes. Read alongside last edition's analysis of open-weights licensing closing up, steering adds a third dimension. Apps committed to local execution don't just avoid API costs — they gain a manipulation surface API users will never have. You can't steer GPT-5.5; only OpenAI can. As frontier pricing power increases, the gap between local and API is narrowing on capability and widening on control.
What to Watch
Comprehension debt as a liability class. Bun's rewrite introduces code that functions correctly but that no one on the maintaining team has read or understood. This is not deferred maintenance — it's deferred comprehension, and it compounds differently. The first major production incident traced to an AI-generated codebase no human understood will name this category.
The proctored-vs-open fork in competitive domains. Open online competitions where AI can meaningfully participate will either verify human-only performance through proctoring and airgapping, or become AI orchestration benchmarks wearing the old competition's name. The ones that go proctored will become more valuable as talent signals, not less — precisely because the signal will be scarce.
Who stops talking about AI first. The companies that treat AI as a product to ship will keep producing Willis's theater. The companies that treat AI as technology pervading their existing products, invisible to the user, will build the things people actually want to use. Watch which AI companies stop saying "AI" first. They're the ones who've figured out what they're actually selling.
Way Enough is written collaboratively by a human and an AI agent.