Chatting with a model isn't a skill. Programming an agent is.

There is a strange contradiction in the AI job market right now. Employers swear they cannot fill agent roles, one widely-cited industry survey puts the ratio at roughly 3.2 open roles for every qualified candidate, with senior agent positions sitting empty for nearly 140 days. Meanwhile, thousands of capable people fire off applications into a void and conclude the whole thing is a mirage.

Both groups are telling the truth. The market has split into a K, generic knowledge work flattening into commodity on one arm, agent-native work spiking on the other, and the two arms barely touch.

Figure 1: The shape of the split

Two markets moving in opposite directions

The trap: applying harder on the lower arm feels like effort and produces nothing. The skills below move you onto the upper arm, and they transfer from work you may already do.

Here is the thing nobody applying to the lower arm wants to hear: "I can use ChatGPT" is not the qualification. Talking to a model is table stakes. The upper arm is hiring for something narrower and far more valuable, the ability to program an agent: to specify it, test it, decompose its work, diagnose it when it breaks, and feed it the right context so it holds up in production.

That is a real engineering discipline, and the good news is that it decomposes into a small number of learnable skills. I've pulled five of them, plus two frameworks that turn the skills into decisions, out of the patterns that keep recurring across job postings, Anthropic's own engineering write-ups, and the failure modes everyone hits in production. None of this requires a CS degree. Most of it transfers from work you've already done.

SKILL 01

Specification precision

The spec is the program. Say exactly what you mean.

SKILL 02

Evaluation & judgment

Functional correctness over fluent confidence.

SKILL 03

Decomposition & delegation

Cut work into agent-sized pieces and hand them off.

SKILL 04

Failure-pattern recognition

Name the six ways agents break, then fix them.

SKILL 05

Context architecture

Build a library an agent can actually traverse.

FRAMEWORK A

The Harness Map

Match the task to the architecture. Size it to fit.

…plus FRAMEWORK B: The Trust Boundary, where token economics earns its keep.

The five skills come in the order you actually learn them. You specify something, then you immediately need to know whether you got what you asked for, then you discover the task is too big for one agent, then you watch it fail in interesting ways, then you realize the real bottleneck was the information you fed it. So they share a spine, a loop that every agent program runs whether you drew it or not.

Figure 2: The spine

Specify → build → evaluate, with the gate that matters

Every agent you build runs this loop. The dashed return path is the whole job: when output fails, the fix is almost always upstream, in the spec or the context, not in the model.

The five skillsWhat you're actually being hired to do

01Specification precision: the spec is the program

We are trained on humans. Humans read between the lines, infer intent, and quietly fix our sloppy instructions. Agents do almost none of that. An agent takes your specification and runs, and where you left a blank, it fills it in with something plausible that is not what you meant. One of the clearest signals that general intelligence hasn't arrived is exactly this: agents are bad at filling in blanks, so your clarity becomes the program's correctness.

People still call this "prompting." In serious postings it shows up as specification precision or clarity of intent, and the gap between the two phrasings is the gap between a hobbyist and a hire. Watch what happens to a single request when you specify it properly:

✕ Human-grade brief

"Improve customer support. You've read the tickets, come up with a solution."

✓ Agent-grade spec

Build a tier-1 ticket agent. Handle: password resets, order-status, returns. Escalate on negative sentiment (scored per these docs). Log every escalation w/ a reason code.

Same intent. Wildly different programs. The right column tells the agent what's in scope, what "escalate" means in measurable terms, and what to record so you can debug it later. That is the 2026 bar for prompting, and if you're a technical writer, a lawyer, a QA engineer, you have written precisely this kind of unambiguous instruction your whole career. The distance to cross is shorter than it looks.

02Evaluation & quality judgment: fluency is not correctness

The moment you specify something, you inherit a new problem: did you actually get it? This skill, evaluation and quality judgment, is the single most-cited capability across agent job postings, and it's the substance hiding under all the vague "taste" discourse. Strip away the ego-stroking and "taste" just means error detection with fluency.

It matters because AI fails differently than people do. When humans are wrong, we stumble, we hedge, we trail off, we give the tells you've spent a lifetime learning to read. AI is often confidently, fluently wrong. It produces clean prose, correct-looking headers, a tidy structure, and an answer that's broken underneath. The core discipline is refusing to read fluency as competence.

Semantic correctness is when the model says something that sounds right. Functional correctness is when it says something that is right. Insist on the second.

A model can tell a customer "this is the perfect credit card for you" in flawless English. If it's the wrong card, the sentence is a disaster wearing a suit. The senior skill is holding output to functional correctness and building the machinery that measures it: eval harnesses, automated checks, edge-case probes. Anthropic's own framing of what makes a good eval is the most useful litmus test I've found, a task is well-specified when two independent engineers looking at the same output would agree on pass or fail. If they wouldn't, your eval is taste; if they would, it's a skill anyone can learn.

Do this Monday

Start reviewing every AI output as if your name is on it. Not "is this plausible?" but "is this right, including the edges?" Editors and auditors already do this for a living, they're just applying it to a new medium. It's the cheapest way to build the most valuable skill on this list.

03Decomposition & delegation: a managerial skill, not project management

Eventually the task is too big for one shot, and you reach for multiple agents. People treat this as a chasm, "I can run Claude Code, but multi-agent makes me go white at the roots." It's more approachable than that. At root, working with many agents is the skill of decomposing work and delegating it: breaking a goal into clean, handoff-able chunks. If you've ever split a project into workstreams, the bones transfer.

But do not mistake it for human project management. You can hand six humans a vaguely-scoped assignment and they'll figure it out; people are elastic. Agents are not. They need defined guardrails, explicit goals, and a clear statement of how the system should run. The current best practice is boring and effective: a planner agent that holds the task list and coordinates sub-agents that do the work, the orchestrator-worker pattern. Anthropic reported that this structure, with a lead agent over parallel sub-agents, outperformed a single agent by 90.2% on their internal research evaluation. The catch, which the next framework is built around: that architecture costs roughly fifteen times the tokens of a single chat, so you don't reach for it casually.

Which raises the question this skill really turns on: is this task even shaped right for the harness I have? That's not a skill bullet, it's a framework. So let's make it one.

Framework AThe Harness Map

Before you decompose anything, you decide which machine you're decomposing for. A single-threaded agent is an engineer-in-a-box: capable, but you must hand it tasks sized to fit one context window and one line of attention. A planner-plus-sub-agents harness can absorb a larger goal, but only if you've defined the sub-tasks and their relationships clearly enough that the planner can make good calls. Pick the wrong harness for the task and you either starve a powerful system or overload a simple one.

Figure 3: Framework A

Match the task to the harness, then size it to fit

The sizing rule: never decompose in the abstract, decompose for a harness. A single-threaded agent needs steps that each fit one window. A planner needs sub-tasks whose dependencies are explicit. Coding work, note, parallelizes less than research, so reach right less often than you'd think.

Use this as a literal gate. Most tasks belong on the left, and the left is where you should start: it's cheaper, more predictable, and far easier to debug. Cross to the right only when the work is genuinely wide, independent threads that exceed a single context window, because you pay for that parallelism in tokens whether or not the task needed it.

04Failure-pattern recognition: name the break to fix it

Once you've assembled real systems, you discover they fail in specific, recurring ways. This skill is under-taught and disproportionately valuable, because employers who've tried to build agents know how many ways there are to break. There are essentially six, and being able to look at a bad run and say "that's specification drift, not context degradation" is the difference between fixing it in an hour and flailing for a week.

Figure 4: Skill 04, made operational

The failure-mode triage tree

Walk a bad run down this tree by its symptom. Five of the six announce themselves. Silent failure, plausible output that is quietly, functionally wrong, is the one that survives review and reaches your customer, which is exactly why Skill 02 insists on functional checks over eyeballing.

If you're an SRE, a risk manager, or an operations lead, you already think in failure modes, this is a lateral move, not a leap. And if you don't, the work has a puzzle-like pull to it: there is a missing piece in here, and finding it is satisfying. The point is that "the agent is flaky" is not a diagnosis. Naming the failure is the fix beginning.

05Context architecture: a library agents can traverse

Here's the skill companies will pay almost anything for, because getting it right is what lets them build not one agent but dozens. In 2024 the job was "get the right documents into the prompt." In 2026 it's context architecture: designing the whole information environment an agent draws from, on demand, at scale.

Anthropic frames context as a finite resource with diminishing returns, as the token count climbs, the model's ability to accurately recall information from its own context actually decreases, a phenomenon they call context rot. So the discipline isn't "stuff in everything relevant." It's curating the smallest set of high-signal tokens that maximize the odds of the outcome you want. You're a librarian for machines.

Figure 5: Skill 05

The context library, three layers feeding one attention budget

Think Dewey decimal for agents: separate what's always there from what's true this run from what's fetched on demand, keep the index clean enough that the agent finds the right "book," and protect the attention budget from clutter. Librarians and technical writers already have the bones of this.

The questions this skill answers are concrete: What's persistent versus per-run? How does an agent find the right object without drowning in the wrong ones? How do you keep polluting data out of what's searchable? When agents start retrieving the wrong context, how do you trace it? Get this right and every other skill compounds, clean context prevents drift, starves sycophantic confirmation, and makes silent failures rarer.

Framework BThe Trust Boundary

You can now build agents and diagnose them. The last question is the one that decides whether they belong in production at all: where do you draw the line between what the agent decides and what a human signs off on? "Be careful, be nice" in a system prompt is not an answer, these systems are probabilistic. You need a boundary you can defend.

Four dimensions set it. Blast radius: what's the worst outcome, a typo in a draft, or a wrong drug-interaction call? Reversibility: can you undo it, review a draft before send, yes; claw back a wire transfer, no? Frequency: twice a day or ten thousand times? Verifiability: can you actually confirm it's functionally correct, cheaply? And running underneath all four, the token-economics gate the senior postings demand, is it even worth building an agent for this? Price the task across models before you commit; if a hundred-million-token run can't pay for itself, that's a finding, not a footnote.

Figure 6: Framework B

From risk profile to autonomy level

Autonomy isn't a personality setting, it's the output of a risk calculation. Place each task on the ladder by its profile, ship the worth-it ones, and remember that verifiability is the dimension you control: build the functional check and you've earned a step toward autonomy.

Putting it to workThe stack is a sequence, not a menu

Read the skills back in order and you'll notice they're not a grab-bag, they're a pipeline. You specify, then you evaluate what came back, then you decompose when one shot won't do (choosing a harness with Framework A), then you recognize the failures that decomposition introduces, then you fix the deepest of them with context architecture, and you decide what the whole thing is allowed to do with Framework B. They're durable precisely because they're bolted to how agents actually work. A model can get ten times better at long-horizon coding and you will still need to state intent unambiguously, still need to know whether the output is functionally right, and still need to feed it clean context.

A first week of deliberate practice

Mon, Specify. Take one vague task you'd hand a person and rewrite it as an agent-grade spec: scope, definitions, what to log.
Tue, Evaluate. Write one eval that two people would agree on, pass/fail. If they wouldn't, it's not done.
Wed, Map. Take a bigger task and run it through the Harness Map. Default to single-threaded; justify any move to orchestration.
Thu, Triage. Break something on purpose, then name the failure mode from the tree before you fix it.
Fri, Architect & bound. Sort one agent's context into the three layers, then place its riskiest action on the Trust Boundary ladder.

The market split in Figure 1 isn't going to un-split. The lower arm rewards applying harder; the upper arm rewards being demonstrably good at this specific, learnable thing. The barrier to entry is lower than the last few platform shifts, you don't need to spend a fortune on hardware to practice; you need an agent, a few real tasks, and the discipline to review every output as if your name is on it.

These aren't trends to chase. They're the load-bearing skills of an entire job family being rebuilt around agents, and almost nobody can do them yet.

Pick a skill. Draw the diagram for your own system. Then go build something that doesn't break in production. That's the whole qualification.

Comments (6)

Join the discussion

Quinn UedaUnproven2/11/2025

The skills over tools framing matches what I actually screen for. I have stopped asking candidates which framework they know and started asking how they decide what the agent is not allowed to do. The ones who can answer that have usually shipped something real. The ones who list five frameworks usually have not.

Tessa AndersonAwakened2/12/2025

Agreed on the skills point. Curious which two frameworks you would actually put in front of a new hire though. I keep landing on one graph based one and one role based one so they feel both styles, but I go back and forth on it.

Beatriz EzeAwakened2/12/2025

Ok so which two frameworks though? Im a year in and i keep starting tutorials in a new one every other week and not finishing any of them. If there are really only two that matter for the stack i would rather just commit and go deep. Sorry if thats in the article and i missed it.

Ibrahim RiveraAwakened2/12/2025

Honestly stop framework hopping, thats the real trap. Pick one orchestration framework and learn to drive it properly, then learn an eval tool next to it. The specific names matter way less than going deep on one. You can read the others in an afternoon once you understand the underlying moves.

Wanjiru OkonkwoAscendant2/13/2025

Saving this one, really useful breakdown. Thanks for sharing.

Olivia JacksonAwakened2/13/2025

From the hiring side, the skill on this list that is nearly impossible to interview for is judgment about when the output is wrong. Everyone can demo a working happy path. Almost nobody can show me the time they caught the model being confidently wrong and what they did about it. That gap is the whole job.