Every frontier jump improves generation faster than most teams improve verification. Enterprise risk compounds silently: more output, same taste perimeter.
Here is the claim, stated so it can be argued with: each frontier model release widens the gap between what a system can do and what an organization can verify it did, and that gap grows faster than the value the model adds. I am a benchmarks person by trade, so I will hold this to the same bar I hold a paper: claim, then evidence, then a source you can open in another tab. Where I only have a trajectory rather than a number, I will say so.

I want to be careful about what kind of essay this is. It is not a security-FUD post. Frontier models are genuinely useful, and they make defenders better too, a point I will steelman before the end. But "useful" and "safe to deploy without instrumentation" are different propositions, and the gap between them is measurable. So let me measure it.
Start with the cleanest before-and-after I have seen. Forescout, testing frontier models against vulnerability-research tasks under its Mythos program, reports that roughly a year ago about 55% of models tested failed at basic vulnerability research. Today, in their testing, every model evaluated completes those tasks, and about half generate a working exploit. That is not a press-release adjective. It is a year-over-year completion rate moving from coin-flip to near-saturation on a task class that used to require a specialist.
Forescout's own framing is the part worth quoting, because it names the mechanism rather than the vibe:
"What Mythos represents: a class of capability that will proliferate, and that fundamentally compresses the timeline between a vulnerability existing and a working exploit being in the hands of adversaries." (Forescout, summarized in the CSA Mythos CISO briefing)
Hold onto the word compresses. The capability curve is not just rising, it is shortening the interval between two events that used to be comfortably far apart. That is the input to everything that follows.
Figure 1 · the two curves
Capability saturates a task; verification does not keep pace
Where the gap lands first
If frontier capability only touched your own stack, this would be a contained problem. It does not. Bitsight puts the same compression mechanism in supply-chain terms: "Frontier AI compresses the time between a vulnerability appearing and its exploitation." Pair that with the structural fact they cite, that about 63% of data breaches trace to third parties, and the exposure is obvious. The faster the exploit window closes, the less an annual vendor questionnaire is worth, because the questionnaire describes a posture that can now change in hours.
This is the part of the argument I find most under-discussed. The conversation about frontier risk tends to fixate on the model you chose. The arithmetic says most of your attack surface is composed of models other organizations chose, wired into services you depend on, assessed on a cadence that predates the capability curve in Figure 1. Continuous monitoring stops being a maturity nicety and becomes the only sampling rate that matches the signal.
Suppose you accept that you need to verify more, faster. The next question is whether current evaluation practice can even localize a failure, and the honest answer from the literature is: not easily. Microsoft Research's AgentRx work states the problem precisely:
"AI agents often fail in ways that are difficult to localize because executions are probabilistic, long-horizon, multi-agent, and mediated by noisy tool outputs." (Microsoft Research, AgentRx)
Read that as a measurement critique, which is what it is. A scalar pass/fail score over a long-horizon, multi-step trajectory throws away exactly the information you need to fix anything: which step went wrong, and why. AgentRx's contribution is trajectory-level diagnosis, attributing a failure to a critical step rather than reporting a number. That is the right granularity, and it is also a tacit admission that the eval tooling most teams run today is too coarse to govern frontier-powered agents. If you cannot localize the failure, you cannot close the gap; you can only observe that it exists.
So capability is compounding and verification is not. Does the value at least outrun the risk? The aggregate numbers say no, or at least not yet. Per a McKinsey survey from November 2025, cited in exploreagentic.ai's AI-agent ROI playbook, roughly 88% of firms report adopting AI while only about 39% report enterprise-level EBIT impact from it. Adoption is nearly saturated. Bottom-line effect is stuck below two in five.
Figure 2 · the value gap, in two bars
Almost everyone adopts; far fewer see it in EBIT
Put Figure 1 and Figure 2 next to each other and the thesis stops being rhetorical. Capability is saturating. Value realization is not. The most parsimonious explanation is that the missing layer sits between them: the ability to verify, trust, and operationalize what the models produce. Better raw capability does not fix that. It widens the thing it cannot reach.
I am contrarian, not fatalistic, so let me argue the other side with the same rigor. Three honest objections.
First, frontier models help defenders too. The same vulnerability-research capability that compresses the attacker's timeline also compresses triage, patch discovery, and detection for the defense. The gap is a function of operational maturity, not a law of physics. An organization that instruments well can run the capability on its own side of the ledger.
Second, security FUD has a cost. Over-indexing on the scary half of the curve stalls legitimate adoption, and stalled adoption is its own EBIT story. The correct response to Figure 2 is not to slow down; it is to invest in the verification layer that turns adoption into impact.
Third, raw frontier is not the only path to good outcomes. This is the result I find most clarifying: a mid-tier model wrapped in a disciplined harness, critic loops, retries, and tight tool contracts, can outperform a more capable model run naked. The AgentFixer and CUGA-style patterns are evidence that the harness is doing real work. Which is the optimistic reading of this entire piece: the gap is closeable with engineering you already know how to do.
The capability numbers (55% to near-100%, half shipping exploits) are Forescout's. The 63% third-party figure is Bitsight's. The 88/39 split is McKinsey via the ROI playbook. The verification curve in Figure 1 is illustrative and labeled as such. If you reuse any of these in a deck, open the footer link and re-read the original framing first; survey stats drift and point-in-time counts age.
If the gap is the problem, the verification stack is the answer, and it is buildable today out of three layers. An eval harness that scores outputs against task-specific checks rather than vibes. A critic loop that reviews and rejects before anything ships, the part that lets a smaller model punch above its weight. And observability over full trajectories, so when something fails you can localize the step in the AgentRx sense, not just log that the run was red. Capability feeds autonomy at the top; without these three layers underneath, autonomy converts directly into unverified risk.
Figure 3 · the layer that converts capability into value
A frontier model needs three layers under it
This is also where the argument connects to a discipline that has nothing to do with security and everything to do with taste. A critic loop is, mechanically, an apparatus for saying no to outputs that do not meet the bar. If you have been thinking about how a team builds the judgment to reject confidently, the piece on scaling your "no" is worth a look on exactly that muscle. And for the layer-one specifics, how to actually stand up an eval harness that scores trajectories rather than vibes, an interesting read for that is the eval-harness blueprint.
The interesting question for 2026 is not "is the model good enough." On most enterprise tasks, it already is. The question is "can you verify what it did fast enough to trust it," and for most organizations the honest answer is still no.
The organizations pulling into the 39% are not, in my reading of the trajectories, the ones with the largest model budget. They are the ones whose verification maturity has kept pace with their capability intake. That is a claim about operational level, not about access. For a clean framing of those maturity rungs, capability versus the engineering discipline wrapped around it, the levels-of-agentic-engineering write-up is worth taking a look at; it maps the ladder better than I can in a paragraph. And if you want a read on where the field's attention is actually heading versus where the keynotes point, the 2026 conference roundup is a useful gut-check.
So adopt the frontier models. They earned the adoption, and the defensive upside is real. Just do not mistake a rising capability curve for a rising value curve. They diverge, the wedge between them is risk, and the only thing that closes it is verification you build on purpose. Better models do not shrink that wedge. They are the reason it is widening.
Better models widen the gap faster than value is a claim I want to believe and am wary of believing, because it is so flattering to the verification crowd. That said the mechanism is sound: generation scales with the model, verification scales with your taste infrastructure, and those are not coupled. I would still like a number on how much the gap widened per model generation before I quote this in a deck.
That is the honest tension and I did not have a clean number either, which is why I framed it as compounding risk rather than a measured rate. The qualitative version I stand behind is that every team I have seen improved output faster than they improved review, and the silent part is that nothing alerts you when that gap grows.
From the enterprise seat this matches what I see. Each model upgrade lets the team produce more, and the verification process is the same manual review it was two upgrades ago. So risk per unit of output is constant or worse while volume climbs. The frontier model is not the danger, the unchanged review process behind it is.
As someone newer this reframed what I should be learning. I kept chasing the newest model and this says the durable skill is the verification side that does not change every release. Less exciting, probably better advice.
Comments (4)
Join the discussion
Sign in to comment, bookmark threads, and continue lessons across sessions.