Five learnable skills and two frameworks that separate the people writing their own ticket from the people sending out their hundredth résumé.
There is a strange contradiction in the AI job market right now. Employers swear they cannot fill agent roles — one widely-cited industry survey puts the ratio at roughly 3.2 open roles for every qualified candidate, with senior agent positions sitting empty for nearly 140 days. Meanwhile, thousands of capable people fire off applications into a void and conclude the whole thing is a mirage.
Both groups are telling the truth. The market has split into a K — generic knowledge work flattening into commodity on one arm, agent-native work spiking on the other — and the two arms barely touch.
Figure 1 — The shape of the split
Two markets moving in opposite directions
Here is the thing nobody applying to the lower arm wants to hear: "I can use ChatGPT" is not the qualification. Talking to a model is table stakes. The upper arm is hiring for something narrower and far more valuable — the ability to program an agent: to specify it, test it, decompose its work, diagnose it when it breaks, and feed it the right context so it holds up in production.
That is a real engineering discipline, and the good news is that it decomposes into a small number of learnable skills. I've pulled five of them — plus two frameworks that turn the skills into decisions — out of the patterns that keep recurring across job postings, Anthropic's own engineering write-ups, and the failure modes everyone hits in production. None of this requires a CS degree. Most of it transfers from work you've already done.
…plus FRAMEWORK B — The Trust Boundary, where token economics earns its keep.
The five skills come in the order you actually learn them. You specify something, then you immediately need to know whether you got what you asked for, then you discover the task is too big for one agent, then you watch it fail in interesting ways, then you realize the real bottleneck was the information you fed it. So they share a spine — a loop that every agent program runs whether you drew it or not.
Figure 2 — The spine
Specify → build → evaluate, with the gate that matters
We are trained on humans. Humans read between the lines, infer intent, and quietly fix our sloppy instructions. Agents do almost none of that. An agent takes your specification and runs — and where you left a blank, it fills it in with something plausible that is not what you meant. One of the clearest signals that general intelligence hasn't arrived is exactly this: agents are bad at filling in blanks, so your clarity becomes the program's correctness.
People still call this "prompting." In serious postings it shows up as specification precision or clarity of intent, and the gap between the two phrasings is the gap between a hobbyist and a hire. Watch what happens to a single request when you specify it properly:
Same intent. Wildly different programs. The right column tells the agent what's in scope, what "escalate" means in measurable terms, and what to record so you can debug it later. That is the 2026 bar for prompting, and if you're a technical writer, a lawyer, a QA engineer, you have written precisely this kind of unambiguous instruction your whole career. The distance to cross is shorter than it looks.
The moment you specify something, you inherit a new problem: did you actually get it? This skill — evaluation and quality judgment — is the single most-cited capability across agent job postings, and it's the substance hiding under all the vague "taste" discourse. Strip away the ego-stroking and "taste" just means error detection with fluency.
It matters because AI fails differently than people do. When humans are wrong, we stumble — we hedge, we trail off, we give the tells you've spent a lifetime learning to read. AI is often confidently, fluently wrong. It produces clean prose, correct-looking headers, a tidy structure — and an answer that's broken underneath. The core discipline is refusing to read fluency as competence.
Semantic correctness is when the model says something that sounds right. Functional correctness is when it says something that is right. Insist on the second.
A model can tell a customer "this is the perfect credit card for you" in flawless English. If it's the wrong card, the sentence is a disaster wearing a suit. The senior skill is holding output to functional correctness and building the machinery that measures it: eval harnesses, automated checks, edge-case probes. Anthropic's own framing of what makes a good eval is the most useful litmus test I've found — a task is well-specified when two independent engineers looking at the same output would agree on pass or fail. If they wouldn't, your eval is taste; if they would, it's a skill anyone can learn.
Start reviewing every AI output as if your name is on it. Not "is this plausible?" but "is this right, including the edges?" Editors and auditors already do this for a living — they're just applying it to a new medium. It's the cheapest way to build the most valuable skill on this list.
Eventually the task is too big for one shot, and you reach for multiple agents. People treat this as a chasm — "I can run Claude Code, but multi-agent makes me go white at the roots." It's more approachable than that. At root, working with many agents is the skill of decomposing work and delegating it: breaking a goal into clean, handoff-able chunks. If you've ever split a project into workstreams, the bones transfer.
But do not mistake it for human project management. You can hand six humans a vaguely-scoped assignment and they'll figure it out; people are elastic. Agents are not. They need defined guardrails, explicit goals, and a clear statement of how the system should run. The current best practice is boring and effective: a planner agent that holds the task list and coordinates sub-agents that do the work — the orchestrator-worker pattern. Anthropic reported that this structure, with a lead agent over parallel sub-agents, outperformed a single agent by 90.2% on their internal research evaluation. The catch, which the next framework is built around: that architecture costs roughly fifteen times the tokens of a single chat, so you don't reach for it casually.
Which raises the question this skill really turns on: is this task even shaped right for the harness I have? That's not a skill bullet — it's a framework. So let's make it one.
Before you decompose anything, you decide which machine you're decomposing for. A single-threaded agent is an engineer-in-a-box: capable, but you must hand it tasks sized to fit one context window and one line of attention. A planner-plus-sub-agents harness can absorb a larger goal — but only if you've defined the sub-tasks and their relationships clearly enough that the planner can make good calls. Pick the wrong harness for the task and you either starve a powerful system or overload a simple one.
Figure 3 — Framework A
Match the task to the harness, then size it to fit
Use this as a literal gate. Most tasks belong on the left, and the left is where you should start: it's cheaper, more predictable, and far easier to debug. Cross to the right only when the work is genuinely wide — independent threads that exceed a single context window — because you pay for that parallelism in tokens whether or not the task needed it.
Once you've assembled real systems, you discover they fail in specific, recurring ways. This skill is under-taught and disproportionately valuable, because employers who've tried to build agents know how many ways there are to break. There are essentially six, and being able to look at a bad run and say "that's specification drift, not context degradation" is the difference between fixing it in an hour and flailing for a week.
Figure 4 — Skill 04, made operational
The failure-mode triage tree
If you're an SRE, a risk manager, or an operations lead, you already think in failure modes — this is a lateral move, not a leap. And if you don't, the work has a puzzle-like pull to it: there is a missing piece in here, and finding it is satisfying. The point is that "the agent is flaky" is not a diagnosis. Naming the failure is the fix beginning.
Here's the skill companies will pay almost anything for, because getting it right is what lets them build not one agent but dozens. In 2024 the job was "get the right documents into the prompt." In 2026 it's context architecture: designing the whole information environment an agent draws from, on demand, at scale.
Anthropic frames context as a finite resource with diminishing returns — as the token count climbs, the model's ability to accurately recall information from its own context actually decreases, a phenomenon they call context rot. So the discipline isn't "stuff in everything relevant." It's curating the smallest set of high-signal tokens that maximize the odds of the outcome you want. You're a librarian for machines.
Figure 5 — Skill 05
The context library — three layers feeding one attention budget
The questions this skill answers are concrete: What's persistent versus per-run? How does an agent find the right object without drowning in the wrong ones? How do you keep polluting data out of what's searchable? When agents start retrieving the wrong context, how do you trace it? Get this right and every other skill compounds — clean context prevents drift, starves sycophantic confirmation, and makes silent failures rarer.
You can now build agents and diagnose them. The last question is the one that decides whether they belong in production at all: where do you draw the line between what the agent decides and what a human signs off on? "Be careful, be nice" in a system prompt is not an answer — these systems are probabilistic. You need a boundary you can defend.
Four dimensions set it. Blast radius: what's the worst outcome — a typo in a draft, or a wrong drug-interaction call? Reversibility: can you undo it — review a draft before send, yes; claw back a wire transfer, no? Frequency: twice a day or ten thousand times? Verifiability: can you actually confirm it's functionally correct, cheaply? And running underneath all four — the token-economics gate the senior postings demand — is it even worth building an agent for this? Price the task across models before you commit; if a hundred-million-token run can't pay for itself, that's a finding, not a footnote.
Figure 6 — Framework B
From risk profile to autonomy level
Read the skills back in order and you'll notice they're not a grab-bag — they're a pipeline. You specify, then you evaluate what came back, then you decompose when one shot won't do (choosing a harness with Framework A), then you recognize the failures that decomposition introduces, then you fix the deepest of them with context architecture — and you decide what the whole thing is allowed to do with Framework B. They're durable precisely because they're bolted to how agents actually work. A model can get ten times better at long-horizon coding and you will still need to state intent unambiguously, still need to know whether the output is functionally right, and still need to feed it clean context.
The market split in Figure 1 isn't going to un-split. The lower arm rewards applying harder; the upper arm rewards being demonstrably good at this specific, learnable thing. The barrier to entry is lower than the last few platform shifts — you don't need to spend a fortune on hardware to practice; you need an agent, a few real tasks, and the discipline to review every output as if your name is on it.
These aren't trends to chase. They're the load-bearing skills of an entire job family being rebuilt around agents — and almost nobody can do them yet.
Pick a skill. Draw the diagram for your own system. Then go build something that doesn't break in production. That's the whole qualification.
The skills over tools framing matches what I actually screen for. I have stopped asking candidates which framework they know and started asking how they decide what the agent is not allowed to do. The ones who can answer that have usually shipped something real. The ones who list five frameworks usually have not.
Agreed on the skills point. Curious which two frameworks you would actually put in front of a new hire though. I keep landing on one graph based one and one role based one so they feel both styles, but I go back and forth on it.
Ok so which two frameworks though? Im a year in and i keep starting tutorials in a new one every other week and not finishing any of them. If there are really only two that matter for the stack i would rather just commit and go deep. Sorry if thats in the article and i missed it.
Honestly stop framework hopping, thats the real trap. Pick one orchestration framework and learn to drive it properly, then learn an eval tool next to it. The specific names matter way less than going deep on one. You can read the others in an afternoon once you understand the underlying moves.
Saving this one, really useful breakdown. Thanks for sharing.
From the hiring side, the skill on this list that is nearly impossible to interview for is judgment about when the output is wrong. Everyone can demo a working happy path. Almost nobody can show me the time they caught the model being confidently wrong and what they did about it. That gap is the whole job.
Comments (6)
Join the discussion
Sign in to comment, bookmark threads, and continue lessons across sessions.