A three-person team ships production software no human writes or reviews. Down the hall, experienced developers get measurably slower with AI — and never notice. The distance between them is the most important gap in software, and no tool can close it.
Here are two facts that are both true right now, in the same industry, on the same planet. At Anthropic, roughly 90% of Claude Code's own codebase is now written by Claude Code, and the engineer who built it says he hasn't edited a line by hand in months. And in a rigorous randomized controlled trial, experienced open-source developers using AI tools took 19% longer to finish their tasks — while believing they'd gone 20% faster. The frontier is sprinting. The middle is slowing down and congratulating itself. The gap between those two realities is what this piece is about — and, more usefully, what it takes to cross it.
Start with the measured middle, because it's the part the hype skips. The trial was run by METR — not a survey, an RCT, the same method used for drug trials. 16 experienced developers completed 246 real tasks in mature repositories where they averaged five years of experience; each task was randomly assigned to allow or forbid AI tooling. Going in, they forecast AI would cut their time by 24%; afterward they estimated it had saved 20%; in reality, allowing AI increased completion time by 19%. The gap between perception and measurement wasn't a rounding error. It was the wrong sign entirely.
Figure 1 — The paradox
Same tools, opposite outcomes
To talk about the gap precisely, we need a ruler. The most useful one comes from Dan Shapiro — Glowforge's CEO and a Wharton research fellow — who in January 2026 published a framework he calls the five levels of vibe coding. The flippant name hides a serious idea: he modeled it on the federal government's levels of driving automation, because the human role recedes at each stage in the same structural way. Like driving, it's zero-indexed — so really six levels.
The point of the ladder isn't to score yourself a badge. It's to see, honestly, how far back the human has actually stepped — and to notice the two places almost everyone gets stuck.
Figure 2 — The anchor
The levels of agentic engineering (L0 → L5)
Walk it rung by rung, because the differences are easy to blur. Level 0, spicy autocomplete: you type, the AI finishes the line — original GitHub Copilot, a faster tab key. Level 1, the coding intern: you hand over one well-scoped task — write this function, refactor this module — and review all of it; the AI does the task, you keep the architecture and judgment. Level 2, the junior developer: the AI makes multi-file changes and builds features across modules, and you're reviewing more, but you still read every line. Shapiro estimates 90% of developers who call themselves "AI-native" are operating here — and the trap is that L2 feels like the summit when it's the foothills.
Level 3, the developer as manager, is where the relationship flips. You stop writing code; you direct the agent and review what it submits at the PR level. Almost everybody tops out here — not because the tools can't go further, but because of the psychological difficulty of letting go of the code. That wall is the review barrier, and it's the first of two places the industry piles up. Level 4, the developer as product manager: you write a specification, leave, and come back hours later to check whether the tests pass — the code is a black box you don't read, because your evaluation is complete enough that if it passes, you trust it. Level 5, the dark factory: specs in, working software out, lights off — no human writes the code and no human reviews it. Crossing from L3 to L4/L5 is the trust barrier, and it isn't bought with a better model. It's earned with better evaluation.
This framework is worth its weight because it gives honest language to a conversation drowning in marketing. When a vendor says their tool "writes code for you," they usually mean L1. When a startup says "agentic software development," they often mean L2 or L3. The distance between that language and a team genuinely operating at L5 is enormous — and closing it requires changes that go far beyond picking a tool.
So why are capable, experienced engineers measurably slower with AI? Not because they're bad at it. Because of where the friction lands. They spend time evaluating suggestions, correcting almost-right code, context-switching between their own mental model and the model's output, and debugging subtle errors in generated code that looked correct but wasn't. As one engineer put it, AI makes writing code cheaper but owning it more expensive. This is the J-curve: bolt an assistant onto a workflow built for humans and productivity dips before it climbs — sometimes for months.
Figure 3 — The adoption J-curve
Productivity dips before it climbs — and most orgs read the dip as failure
GitHub Copilot is the cleanest illustration: tens of millions of users, lab studies showing ~55% faster completion on isolated tasks — and, in production, larger pull requests, higher review costs, and new categories of security defects. The lesson generalizes. The organizations capturing real gains aren't the ones that installed a tool and ran a lunch-and-learn. They're the ones that rebuilt the whole development workflow around AI capabilities: how they write specs, how they review, what they expect from junior versus senior engineers, and how their CI/CD pipelines catch the new error classes AI introduces. End-to-end transformation is slow, expensive, and politically contentious — which is exactly why most companies stall at the bottom of the J and the frontier pulls away faster.
What does L5 actually look like when it's real and not a slide? The most thoroughly documented example is StrongDM's "software factory," which Simon Willison — among the most careful observers in developer tooling — called the most ambitious form of AI-assisted software development he'd seen yet. The detail matters, because it shows the machinery a dark factory actually needs.
The team is three engineers — Justin McCarthy, Jay Taylor, and Navan Chauhan — who started in July 2025, and pointedly, they build security software, the last place you'd expect anyone to delete human code review. Their two founding rules are blunt: code must not be written by humans, and code must not be reviewed by humans. Their inflection point was Claude 3.5 Sonnet in late 2024 — the moment long-horizon agentic coding started compounding correctness instead of compounding errors. The agent itself, an open-source project called Attractor, is a repo containing no code at all — just three markdown files specifying the software in meticulous detail. The output is unmistakably real: their AI context store, CXDB, runs to 16,000 lines of Rust, 9,500 of Go, and 6,700 of TypeScript, in production.
The genuinely new idea — the one that should reshape how you think about testing AI-built systems — is how they validate. They don't lean on traditional tests as the primary quality gate, and the reason is subtle. Tests usually live inside the codebase, which means the agent can read them — and an agent that can see the test will optimize to pass the test rather than to build correct software. It's teaching to the test, except it's the default behavior, not an integrity failure. StrongDM's answer is scenarios: behavioral specifications stored outside the codebase, which the agent never sees during development. They function as a machine-learning holdout set — the agent builds, the unseen scenarios judge, and the system can't game what it can't see.
Figure 4 — The novel mechanism
How a dark factory ships without human review
One honest caveat, because it's the right kind of skepticism: if the agent that builds and the evaluator that judges share the same blind spot, no amount of testing diversity removes the risk that both miss the same thing. The holdout set mitigates the circularity; it doesn't fully dissolve it. Anyone designing for L4/L5 has to budget for that residual risk rather than pretend it's gone. StrongDM signals how seriously they take the compute side with a deliberately provocative benchmark: if you haven't spent on the order of $1,000 per engineer per day, your factory has room to run harder. That's not a flex — it's the cost of giving agents enough runway to actually converge, and it's still often cheaper than the humans it replaces.
This self-referential loop isn't unique to one team. The clearest proof is the tools building themselves: an Anthropic spokesperson put company-wide AI-authored code at 70–90%, with about 90% of Claude Code written by Claude Code itself, and Claude Code now generating roughly 4% of all public git commits. The role didn't disappear — it moved. As Cherny puts it, someone still has to prompt the models, talk to customers, coordinate with other teams, and decide what to build next. The contrast with the broad industry is the whole story: Microsoft reported around 30% AI-generated code, and a Science study found about 29% of U.S. GitHub Python functions are AI-written. The frontier is at 90; the field is at 30; the slowed-down middle is below its own baseline. Same tools, three different worlds.
Here's the part that's harder to see than the technology and probably matters more. Almost every structure in a software organization exists to solve a human coordination problem. Standups exist because people on one codebase must resync daily. Sprint planning exists because humans hold only so many tasks in working memory. Code review exists because humans make mistakes other humans can catch. QA exists because the people who built the thing can't evaluate it objectively. Every ceremony is a workaround for a human limitation — and when the human is no longer writing the code, those workarounds stop being neutral. They become friction.
This is why a three-person factory has no sprints, no standups, and no ticket board. They write specs and evaluate outcomes; that's the job. The entire coordination layer — the layer many managers spend the majority of their time maintaining — isn't trimmed as a cost measure. It's deleted, because it no longer serves a purpose. The center of gravity for the surviving roles shifts from coordination to articulation.
Figure 5 — The shift that's hard to see
The coordination layer collapses into articulation
If that sounds like a trivial reshuffle, you've never tried to write a specification detailed enough for an agent to implement correctly without a human filling the gaps — and you've certainly never tried to coach someone else to do it. The machine has no Slack channel to ask "did you mean X or Y?" It builds what you described; if the description was ambiguous, you get software that fills the gaps with statistically plausible guesses rather than customer-centric ones. The bottleneck has moved from implementation speed to spec quality — and spec quality is a function of how deeply you understand the system, the customer, and the problem. That depth has always been the scarcest resource in software. The dark factory doesn't reduce demand for it. It makes it the only thing that matters.
The machines stripped away the camouflage. Implementation complexity used to hide how few people were truly good at deciding what to build. Now we're going to find out.
Everything above assumes greenfield. Most of the software economy is not. The vast majority of enterprise software is brownfield: systems accreted over years, running in production, carrying real revenue — monoliths grown through a decade of feature additions, pipelines tuned to one team's quirks, config that lives in the heads of three people who remember why one environment variable is set the way it is. You cannot bolt a dark factory onto that, because the specification for it does not exist. The tests, if any, cover a fraction of the code; the rest runs on institutional memory. In a legacy system, the running code is the only complete specification of what the software does — because nobody ever wrote down the thousand implicit decisions that accumulated as patches and "temporary" workarounds that became permanent.
So for most organizations the path doesn't begin with "deploy an agent that writes code." It begins with the slow, deeply human work of reverse-engineering what the system actually does — which depends on the engineer who knows why the billing module has one edge case for Canadian customers, the architect who remembers which microservice was carved out under duress during an outage, and the product person who can say what the software does for real users versus what the spec claims. Domain expertise, ruthless honesty, customer understanding, systems thinking: exactly the human capabilities that matter more in the dark-factory era, not less. The migration looks roughly like this.
Figure 6 — The realistic enterprise path
How brownfield orgs actually climb
There's a human cost we have to name, not wave away. The classic software career is an apprenticeship in disguise: juniors learn by writing simple features and fixing small bugs; seniors review and mentor; over five to seven years a junior becomes a senior through accumulated reps. AI breaks that model precisely at the bottom. If agents handle the simple features and small fixes, and review code faster and more thoroughly than a senior doing a PR pass, the rungs juniors used to climb are gone — and the data shows it's already happening.
The implications run past the people who can't find a first job, bad as that is. The Harvard analysis found junior employment falling 9–10% within six quarters of a company adopting generative AI while senior employment barely moved — the ladder is hollowing from underneath, seniors at the top, AI at the bottom, a thinning middle where learning used to happen. And here's the twist: we need more excellent engineers than ever, not fewer. The bar is simply rising toward the skills that were always hardest to build. The junior of 2026 needs the systems-design judgment once expected of a mid-level engineer in 2020 — not because entry work got harder, but because entry work got automated and what remains demands deeper judgment.
So the advice splits cleanly. If you're early-career: lean all the way into AI, lean into being a generalist, and demonstrate that you can pick up an unfamiliar problem and solve it across a wide range with AI in minutes — because hiring is shifting toward generalists who understand systems, users, and business constraints over narrow specialists. Some organizations are even adopting a medical-residency model: simulated environments where juniors learn by directing and evaluating AI output, building judgment about what's subtly wrong. It's not the same as learning from a blank editor — but it may be better training for a job that is now about directing and evaluating rather than typing. If you're senior or a manager: your hardest new task is coaching that judgment in others, and your value is migrating from coordination to articulation whether or not your title changes.
It's tempting to end on dread, but the structural history argues otherwise — and it's worth being precise about why, so it doesn't read as a comfortable dodge. Every time the cost of computing collapsed — mainframes to PCs, PCs to cloud, cloud to serverless — the total amount of software the world produced didn't hold steady. It exploded, because categories that were economically impossible at the old cost structure suddenly became viable, then ubiquitous. The cloud didn't just make existing software cheaper to run; it created SaaS, mobile, streaming, and real-time analytics that couldn't have existed when shipping meant buying a rack of servers.
The same dynamic is arriving now, at a larger scale. Every regional hospital, mid-market manufacturer, and family logistics company needs software they currently can't afford to build — a custom inventory system that traditionally ran into six or seven figures and a year of work. They make do with spreadsheets. Drop the cost of production by an order of magnitude and that unmet demand becomes addressable. The constraint doesn't vanish; it moves — from "can we build it" to "should we build it, and for whom." And "should we" has always been the harder, more human question. That's also why the AI-native org looks the way it does: Cursor runs in the millions of revenue per employee and Midjourney near $5M, several times the SaaS norm — small groups exceptional at understanding users, translating that into clean specs, and directing systems that implement.
The dark factory doesn't replace the great product thinker. It turns one with five engineers into one with unlimited engineering capacity.
So hold both truths at once, because being honest requires it. The frontier is farther ahead than almost anyone wants to admit — real teams shipping real production software with no human writing or reviewing code, improving with every model generation. And the middle is farther behind than the frontier likes to mention — stuck at L2, measurably slower, running organizations designed for a world where humans do the implementing. The distance between them is not a technology gap. It's a people gap, a culture gap, a willingness-to-change gap that no vendor can close for you. The organizations that cross it won't be the ones that bought the best tool. They'll be the ones that did the slow, unglamorous work of documenting what their systems do and rebuilding their people around judgment instead of coordination — and were honest enough to admit it would take longer than they wanted, because people change slowly.
The dark factory does not need more engineers. It desperately needs better ones — people who can think clearly about what should exist, describe it precisely enough that machines can build it, and judge whether what got built actually serves the humans it was for. That was always the hard part of software. We just used to let implementation complexity hide how few of us were good at it. The camouflage is gone. Time to find out how good we are.
The part about experienced devs getting slower with AI and never noticing is uncomfortably accurate. We had a senior who was sure he was faster, and his cycle time went up for two months because he was babysitting and re-reading every diff. Felt productive, measured slower. Nobody wants to hear that about themselves.
Do you have the actual numbers, or is this the feeling that it was slower? I ask because there is one widely shared study on this and it gets quoted way past what it measured. Your internal cycle time data would honestly be more interesting than the study if you can share the shape of it.
Fair hit. It was PR cycle time from our own dashboards, n of one engineer, no control. So an anecdote with a graph attached, not a study. I would not publish it. It was enough to make us look closer, not enough to prove the mechanism.
The three person team shipping software no human reviews is basically my whole company, minus two people. It is real, but the unglamorous truth is most of the work was building the rejection harness so I can trust it overnight. The shipping itself is easy now. Getting to the point where I trusted it took months.
The gap being a skill gap and not a tool gap is the line that will annoy the most people and is also the most correct thing here. Same tools, wildly different outcomes, two desks apart. No purchase order closes that.
Comments (5)
Join the discussion
Sign in to comment, bookmark threads, and continue lessons across sessions.