Generation is solved. The bottleneck is judgment — and the specific, learnable, scalable form of judgment is saying no to confident AI output, and knowing exactly why. Most teams let every one of those noes fall on the floor.
Ask someone how good they are at AI and they'll tell you about their prompts. Wrong question. The better tell is how often they say no — not no to using AI, but no to its output: this framing is wrong, this reasoning is sloppy, this analysis sounds confident and would collapse the moment it met someone who actually knows the domain. The people who are genuinely good at this reject far more than they accept. They send it back, they say why, and then — after the model crawls from 90% to 95% — they say no again, because their bar is higher than the model knows.
This isn't about superpowered prompting. You can prompt beautifully and still reject most of what comes back, because rejection draws on something prompting doesn't: domain taste. And here's the part nobody is building for — every one of those rejections is a piece of knowledge that did not exist until you articulated it, and almost all of them evaporate into a Slack thread the moment they're spoken. So the same deck comes back tomorrow with the same flaw, and you fight the same fight again.
Figure 1 — The tell
"Good at AI" looks like a high reject rate
To see why rejection is the bottleneck, look at how good generation has gotten. OpenAI built GDPval, the most rigorous measurement we have of AI against real knowledge work: 1,320 tasks across 44 occupations in the nine largest sectors of the U.S. economy, each task built and graded by professionals averaging 14 years of experience, each based on a real deliverable that took a human about seven hours to produce. The grading is blind and head-to-head: experts compare an AI deliverable against a human one without knowing which is which.
Sit with what that means. On well-specified knowledge work, a model already ties or beats seasoned professionals roughly half the time, near-instantly, for a rounding error on the cost. Generation — the thing every AI course is built around, the prompting, the workflow design, the model selection — is no longer the scarce input. It's effectively solved. Which reframes the entire question. If AI matches your best people's output that often, the value was never in producing the draft. It's in the two things that happen next: knowing what to do with the half that looks right but won't survive production, and catching the part where the model confidently whiffs. Both require the same move — someone looks at the output and knows. And knowing, in practice, sounds like the word "no."
These rejections aren't trivial corrections. They're knowledge-creation events, and they span every kind of knowledge work — which is why the examples below deliberately aren't all engineering:
Each of those is a durable rule about quality — and right now each one lives in an email and dies there. Nobody captures it. Nobody compounds it. That's the largest structural gap in the AI tooling ecosystem: organizations generate thousands of expert rejections at the grassroots, and almost every single one falls on the floor.
"Taste" gets waved around as if it were one mystical attribute. It isn't. A skilled rejection decomposes into three distinct competencies that happen in sequence, and naming them is what makes the whole thing learnable, teachable, and eventually scalable.
Figure 2 — The anchor
Recognition → Articulation → Encoding
This is the part that can't be shortcut, because it's the product of years of practice. A junior analyst won't catch a flawed regulatory assumption; a loan officer who hasn't seen enough deals won't spot the covenant-logic error. The person who has reviewed 2,000 deals and simply feels when something is off is becoming the most important person in the building — not despite AI, but because of it, as output floods every desk. And recognition is the dimension AI most amplifies: a domain expert with strong recognition and good tooling can evaluate ten times the output they used to. But that leverage has a hard edge, which deserves its own picture.
Figure 3 — The edge of the leverage
Inside your expertise, AI multiplies expertise. Outside, it multiplies confidence.
"This isn't right" is a rejection. "This isn't right because you're treating all these requirements identically, when the spec needs to separate monitoring triggers" is a constraint. That difference — from a grunt of disapproval to a portable rule — is the difference between taste that stays in one skull and taste the team can use. It's a learnable skill, and almost nobody teaches it. There's a lovely irony here: GDPval itself was built through articulated rejection. Every one of its tasks went through five rounds of expert review, and every round was a rejection event — an expert saying "this isn't representative enough" or "this isn't clear enough to grade." That iterative refusal is what made the benchmark trustworthy. The expert taste didn't just evaluate the AI; it built the evaluation infrastructure.
This is where it all breaks today. Someone articulates a sharp constraint, it lives in an email, and next quarter a different team rediscovers the same lesson from scratch while you burn hours re-litigating it with the model. Encoding is the practice of making the constraint persist — and it connects directly to how AI actually improves. Andrej Karpathy's framing is the cleanest available: traditional computers automate what you can specify; LLMs automate what you can verify. Verification infrastructure — the test suite, the acceptance criteria, the quality gates, the business rules that actually hold — doesn't appear by magic. It is, quite literally, encoded rejections: outputs someone said no to with enough precision that the no could be made permanent. Encoding is just extending that discipline from your CI pipeline to the daily act of rejection every good practitioner already performs.
Once you encode rejections instead of dropping them, something changes character. You're no longer cleaning up one output at a time — you're building a flywheel. AI generates a provocation; the expert rejects it; the rejection gets encoded as a constraint; the constraint feeds back so the next provocation clears a higher bar. Each turn, the taste bar rises and the expert's hours buy more durable constraints. You stop scaling experts and start scaling the encoded residue of expert judgment.
Figure 4 — The compounding loop
How noes turn into a constraint library
Karpathy's verifiability thesis has a corollary that should genuinely keep executives up at night, and it's the single most important idea in this piece. The frontier of AI value is identical to the frontier of your organization's taste. Where your capacity to verify quality reaches, AI can safely create value. Where it doesn't reach, AI doesn't stop — it keeps generating, and now it's generating risk: not the loud kind, but the silent, compounding kind, where an organization produces more and more while understanding less and less, until the output still looks fine but nobody remembers where the bar was.
Figure 5 — Karpathy's corollary, drawn
AI value and AI risk share one border: your taste
So your anti-slop strategy isn't more lectures or better prompts. It's developing, institutionalizing, and eventually automating the skill of rejection — so taste scales past the one head it lives in.
This isn't speculative. The firms that have been quietly encoding taste for years already dominate their categories — they just did it by hand, before AI made the cycle fast. Epic Systems didn't win healthcare on superior technology. It won by spending decades sending its developers onsite to shadow doctors, watch workflows, and absorb the clinical constraints that no requirements document captured — encoding that judgment, hospital by hospital, into a platform. The result, decades later, is unambiguous: Epic anchors care for hundreds of healthcare organizations and more than 280 million patient records, with the largest share of the U.S. acute-care market, and switching costs so structural they can run into the billions. The moat was never the software. It was the encoded judgment about what the software had to get right, built rejection by rejection, workflow by workflow. Bloomberg did the same in financial data; every vertical-SaaS company that genuinely owns its niche is running some version of this play.
The lesson for anyone anxious about software being commoditized: a system built out of encoded taste at scale, to the point where it becomes structural to how its customers operate, is extraordinarily hard to rip out. And the thing that's new is speed — AI compresses the encoding cycle from decades to weeks. The provocation is instant; the rejection is yours; the library compounds.
There's a second payoff, and it lands on the exact problem the previous sheet in this series ended on — the collapsing junior pipeline. Recognition, recall, is built only through years of reps, which is precisely what juniors no longer get now that the simple work is automated and senior-junior mixing has thinned. A living constraint library — the encoded taste of your most senior people, made queryable — short-circuits part of that. A junior can check their work against the accumulated bar and learn, in seconds, what used to require a partner looking over their shoulder for years.
Figure 6 — The second dividend
A constraint library re-mixes juniors and seniors
One design constraint matters more than the rest, and it's why most attempts at this fail: the capture cannot be a separate tool, a spreadsheet, a dashboard, or a new pane of glass. People won't context-switch to feed it — attention is the scarcest resource any of us has. The encoding has to happen where the work already happens, as a near-invisible side effect of the rejection you were going to perform anyway. Get that wrong and the library starves; get it right and it fills itself.
Your competitive moat is not which model vendor you pick — models are commoditizing. It's the depth and durability of your organization's encoded taste. Audit it: where are your domain experts, and are their rejections being captured or evaporating? Start treating encoded domain judgment as an asset class, because it is one.
Create space for articulation. When someone rejects AI output, push them to explain why in a way others can reuse — and then socialize it. A team that articulates its rejections builds shared quality that survives projects, personnel changes, and tool migrations. A team that silently fixes output one piece at a time isn't growing at all.
Your most valuable development isn't the newest tool — tools change. It's deepening your ability to recognize when something's off, practicing the articulation of exactly what's wrong and how to fix it, and helping stand up a system where that taste can scale beyond you.
So much is becoming commodity. The draft, the deck, the first-pass analysis — all cheap now, all fast, all roughly as good as a seasoned professional about half the time. What stays scarce, and is quietly becoming the whole game, is the ability to look at confident output and say this is excellent — or, far more often, this is not, and here is precisely why. We will not run out of things worth holding to a higher bar. The job now is to make that judgment count more than once.
Generation is solved. Rejection is the skill. And a rejection you encode is the only kind that ever stops costing you twice.
Rejection as infrastructure is exactly right and almost nobody instruments it. We started logging every critic veto with the reason, and within a month that log was the most valuable dataset we had. It is basically a list of the ways our agent is wrong, ranked by frequency. You cannot buy that, you have to catch it.
Lurker here, this is the first thing in months that made me want to actually change something at work. We throw all our rejections away. Starting that log monday.
I would add one careful caveat to the central claim. Treating rejection as the scalable skill assumes the rejector has calibrated judgement, otherwise you are scaling a biased filter and calling it taste. The literature on human evaluation is fairly clear that untrained annotators disagree a great deal. The skill the piece describes is real, but it is itself the thing that needs evaluating, not a fixed reference point.
How do you practice saying no when you are not sure you are right? Thats my honest blocker. The model sounds so confident and im junior enough that i second guess myself and just accept it. I dont have the taste yet and i dont know how you build it without shipping a few mistakes first.
You build it by writing down why you rejected something before you check if you were right, then comparing. The note is the practice. Confidence in the output is not evidence, it is just tone. Over a few months your written reasons get sharper and you stop being bullied by a fluent paragraph. Being unsure and writing it down anyway is the whole drill.
Comments (5)
Join the discussion
Sign in to comment, bookmark threads, and continue lessons across sessions.