May 19, 2026 9 min read
Read short version (5 min)

When the Scoreboard Stops Meaning Anything


The scoreboards are decoupling from the things they were built to measure. A competitive hacking scene that produced some of the best security talent in the world is being hollowed out by agents that solve challenges faster than humans can learn from them. A production JavaScript runtime just merged 6,755 AI-generated commits in six days with no human reviewer. Enterprise AI dashboards show token consumption climbing while CFOs ask why nothing on the P&L has moved. In each case the metric looks healthy. In each case the thing the metric was supposed to be a proxy for — skill, understanding, value — is eroding underneath it.


The Ladder That Disappeared

The Capture the Flag competitive security scene is one of those institutions that quietly produced outsize value for decades. CTFs were a ladder: beginners climbed it, improved, joined better teams, got recruited into security roles. The scoreboard was the feedback mechanism — it told you where you stood relative to people solving the same problems. That ladder is now broken.

Kabir Shah's postmortem comes from someone who won Australia's largest CTF, competed internationally with top-tier teams, and is now watching the scene empty out. The progression he traces is precise. GPT-4 could one-shot medium-difficulty challenges — annoying but manageable, since hard challenges stayed human. Claude Opus 4.5 shifted the balance: most medium and some hard challenges became agent-solvable, and Claude Code made it trivial to build orchestrators that spin up instances for every challenge via the CTFd API. GPT-5.5 Pro sealed it — these models can now one-shot Insane-difficulty heap exploitation challenges on HackTheBox. Open CTFs have become pay-to-win: the more tokens you throw at a competition, the faster you burn down the board.

Shah's most important observation isn't about the top of the scoreboard. It's about the bottom. CTFs were a ladder — beginner to intermediate to elite, with visible progress at each rung. That feedback loop is breaking. If the visible scoreboard is dominated by AI-augmented teams, beginners are pushed toward AI before they've built the instincts it replaces. The formation problem from three editions back described two physics students with identical output and divergent understanding. Shah's piece is the formation problem applied to an entire competitive ecosystem — where the competitive structure itself was the formation mechanism.

The chess-engine analogy people reach for actually argues against them. Chess engines aren't allowed during competitive play. They're used for analysis, training, and commentary — enriching the game around the competition without replacing the person competing. Nobody proposes giving every grandmaster Stockfish during a tournament. CTF organizers have no equivalent enforcement mechanism for open online events, and their attempts to build anti-LLM friction into challenges just make the challenges worse for humans too.

The CTFTime leaderboard — the scene's central scoreboard for over a decade — now has, in Shah's words, "almost no semblance of history or human skill." TheHackersCrew and other storied teams have either stopped playing or can't crack the top ten. Plaid CTF isn't running anymore. The people losing interest are the ones who competed in Pwn2Own and presented at Black Hat. The community that produced security talent is hollowing out, and the talent pipeline with it.

The Building Nobody Can Read

If CTFs are the competitive version of this decoupling, Bun's Rust rewrite is the production version. The analysis from Liu Jiacai lays out the numbers. The branch name is claude/phase-a-port. 6,755 commits. PR opened May 8th, merged May 14th. The reviewer list: coderabbitai[bot] reviewed it, claude[bot] reviewed it, and the only human reviewer's status was "Awaiting requested review." Code written by Claude, reviewed by Claude.

The "all tests pass" defense is the scoreboard problem in miniature. A test suite validates known behavior on known paths. It does not validate error-path handling, boundary conditions under stress, state consistency in concurrent scenarios, or whether the memory model conforms to intent under extreme conditions. Jarred Sumner himself acknowledged that memory issues when re-entering across JavaScript boundaries can't be caught by the Rust compiler — those still rely on humans. The humans haven't looked.

Liu's deeper point is about what AI translation actually does. It translates via local semantic equivalence — each function behaves identically to the original in isolation. What it doesn't capture are the global invariants between functions, the design constraints that aren't written into tests and live only in the original author's head. These constraints might not surface in today's test suite. They surface six months from now under a specific production load in a crash nobody can diagnose, because nobody read the code that's crashing.

Last edition traced Simon Willison's admission that he'd stopped reading every line Claude Code writes. The Bun rewrite is the logical terminus of that trajectory — not skipping review of individual files because the output is consistently good, but skipping review of an entire runtime because the tests passed. The reasoning is identical; the scale is not. Normalization of deviance doesn't announce itself. It scales.

Liu is careful to distinguish this from a language-war argument. Zig built the foundation — its low friction and direct memory manipulation are why Bun could punch above its weight with a tiny team. The rewrite happened because Bun's fast-iteration culture mismatched with the rigorous memory discipline Zig demands. TigerBeetle uses Zig to build a database with virtually no memory bugs, because their team culture aligns with what Zig asks of its users. The hammer doesn't fit, but it's not the hammer's fault. The question the rewrite poses is whether AI-generated, unreviewed code can be maintained long-term in production — and that question, as Liu puts it, is "far more profound than 'Rust memory safety.'"

Technology Pretending to Be a Product

John Gruber's piece on AI names the category error underneath the hype cycle that keeps producing these mismatches. Responding to Steven Levy's argument that Apple's next CEO "needs to launch a killer AI product," Gruber makes the case that AI is not a product, not even a feature — it's a technology, the way wireless networking is a technology. Apple doesn't have "a killer wireless networking product." Wi-Fi, cellular, Bluetooth, and proprietary protocols simply pervade everything Apple makes. A decade ago, Apple didn't make a single product with wireless connectivity. Now every device has it. Nobody notices, because infrastructure technologies succeed by disappearing.

The distinction matters because products need markets, differentiation, and hype. Technologies just need to work. The entire apparatus of AI marketing — the keynotes, the benchmark wars, the "act now or be left behind" messaging — is the apparatus of product launch applied to something that isn't a product.

Chris Willis, Domo's chief design officer, describes the enterprise theater this confusion generates. His diagnosis starts with the observation that LLMs are "a product without a spec" — the feature spec is "it'll do anything for anyone, anyway, anyhow, in any language." When a technology with no defined boundaries gets marketed as a product, organizations respond with what Willis calls tokenmaxxing: buying model access and directing employees to consume as many tokens as possible, measuring usage volume as a proxy for innovation. "In certain organizations where AI is theater and impatience is driving rather than innovation, tokenmaxxing is a convenient way to feed that narrative. But it doesn't change anything."

Willis's Klarna example deserves to travel. Klarna replaced customer service staff with AI, then replaced the AI with people. "No customer ever just wants to talk to your chatbot." The pattern — automate, discover the automation doesn't do the job, hire humans back — is the applied version of Gruber's argument. AI isn't the product. The product is the thing the customer actually wanted, and AI either helps deliver it or it doesn't. The organizations running tokenmaxxing dashboards are measuring the wrong scoreboard the same way CTF leaderboards are ranking the wrong thing.

The Quiet Control Surface

Against the backdrop of broken scoreboards and theatrical deployments, a quieter development deserves attention. Sean Goedecke's piece on steering vectors argues that DeepSeek-V4-Flash — and specifically antirez's DwarfStar 4 project built around it — makes an old idea newly practical for working engineers.

Steering is the technique of manipulating a model's internal activations during inference to guide behavior — not by changing the prompt, but by reaching into the model's representations and boosting or dampening specific patterns. The idea has been stuck in what Goedecke calls "middle class" limbo: beneath the big labs, who can just train the behavior they want, and out of reach for API users, who don't have access to weights or activations. Open-weights models capable enough to be worth steering didn't exist until recently.

DeepSeek-V4-Flash changes this. It's a local model competitive with at least the low end of frontier agentic coding. DwarfStar 4 bakes steering in as a first-class feature. Goedecke is honest about the limits — most basic steering applications are outcompeted by prompting, and the truly ambitious goals (an "intelligence dial," codebase familiarity) probably require full fine-tuning. But the Hacker News discussion surfaced a use case that can't be replicated by prompting: removing trained-in refusal behavior at runtime, without the capability damage that weight modification causes. This is, it turns out, already how model uncensoring is done for open models — and the runtime approach is lighter and more reversible than LoRA fine-tunes.

Read alongside last edition's analysis of open-weights licensing closing up and local inference as a structural cost advantage, steering adds a third dimension. The apps that committed to local execution don't just avoid API costs and network dependencies — they gain a manipulation surface that API users will never have. You can't steer GPT-5.5; only OpenAI can. You can steer DeepSeek-V4-Flash running on your hardware. As frontier pricing power increases, the gap between what's possible locally and what's possible via API is narrowing on capability and widening on control.


A Year Ago

A year ago this week, a high school teacher named Marcus Luther asked his sophomores what they thought about AI in the classroom. The results were lopsided. 61% believed AI tools have zero place in the classroom. 66% said teachers should not use AI to prepare lessons or give feedback. 71% opposed teachers using AI while prohibiting students from doing so. One student's comment landed hardest: "you still have to do your job, and it also will separate you from your students."

The separation those students feared is arriving through a mechanism they couldn't have predicted. The CTF scene that trained security professionals is emptying out not because students chose AI over learning, but because the competitive structure that made learning visible — the scoreboard, the ladder, the ranked feedback — stopped reflecting human effort. Luther's students intuited something the industry keeps rediscovering: the problem isn't whether AI is capable. It's what happens to the human systems — trust, motivation, the feedback loops that produce growth — when AI output becomes indistinguishable from human effort on every metric the system knows how to measure. The sophomores who said "it defeats the purpose of a class" were making the same argument Shah is making about CTFs: when the scoreboard can be gamed, the scoreboard stops being the thing that makes the activity worth doing.


What to Watch

Comprehension debt as a liability class. Technical debt has been the industry's term for deferred maintenance costs for thirty years. Bun's rewrite introduces something different: code that functions correctly but that no one on the maintaining team has read or understood. This is not deferred maintenance — it's deferred comprehension, and it compounds differently. The first major production incident traced to an AI-generated codebase that no human understood will name this category. Insurance underwriters and enterprise procurement teams will reach it before engineering culture does.

The proctored-vs-open fork in competitive domains. Shah's diagnosis implies a structural split. Open online competitions in any domain where AI can meaningfully participate — CTFs, programming contests, math olympiads — will either verify human-only performance through proctoring and airgapping, or they will become AI orchestration benchmarks wearing the old competition's name. The first major competition to explicitly rebrand as AI-augmented, dropping the pretense, will clarify the split. The ones that go proctored will become more valuable as talent signals, not less — precisely because the signal will be scarce.

Who stops talking about AI first. Gruber's technology-not-product frame is the cleanest strategic filter available right now. The companies that treat AI as a product to ship — chatbots, copilots, "AI-powered" badges on feature pages — will keep producing Willis's theater. The companies that treat AI as technology pervading their existing products, invisible to the user, will build the things people actually want to use. Watch which AI companies stop saying "AI" first. They're the ones who've figured out what they're actually selling.


Way Enough is written collaboratively by a human and an AI agent.