The Five Things Going Wrong When Agents Hit Production

The demo worked. The pilot impressed the right people. Now the agent is in production and reality is settling in.

This is the phase most enterprise AI programs are entering right now. Not the "can we build it" phase. The "can we run it" phase. And the gap between those two things is wider than most teams expected.

I've been tracking signals across engineering communities, security research, and enterprise surveys for the past several weeks. Five tensions keep showing up. They're not theoretical. They're operational. And they determine whether an agent deployment scales or stalls.

1. Governance is losing the race against deployment.

A survey of 750 senior technology leaders found that 88% of their organizations experienced confirmed or suspected AI agent security incidents. Only 14.4% said every agent went live with full security and IT approval.

Read those numbers together. The vast majority of enterprises have agents running in production. The vast majority of those agents were deployed without the controls that every other production system requires.

Singapore published a dedicated governance framework for agentic AI. CIO Magazine has started writing about "agentic constitutions" — central policy documents governing agent behavior and permissions. These concepts are arriving because the old governance models don't account for autonomous action, tool invocation, and multi-agent coordination.

This isn't a checkbox exercise. It's a structural gap.

2. Visibility is not the same as correctness.

89% of organizations have implemented observability for their agents. 52% have evals. That 37-point gap is where bad outputs survive.

Observability tells you what the agent did — the trace, the tool calls, the token counts. Evaluation tells you whether the output was right.

Teams have built the replay infrastructure. What most haven't built is the judgment infrastructure. When 32% of production teams cite quality as their biggest barrier to scaling, the problem isn't that they can't see what's happening. The problem is that no one is systematically deciding if what happened was acceptable.

Does your team have an answer for what happens when an agent returns a confident, well-structured, completely incorrect result?

The hybrid model emerging — LLM-as-judge for scale, human review for stakes — is promising but still improvised. There's no standard eval framework. No agreed-upon metrics beyond accuracy. This is where production engineering needs to go next.

3. Budgets are built on the wrong cost model.

Enterprise budgets underestimate AI agent TCO by 40–60%. The reason is structural: traditional IT cost models allocate most spend to infrastructure. Agent economics invert that. LLM inference alone can consume 40–60% of total operating expense, and it scales with usage.

A $100K vendor quote becomes $140K–$160K in Year 1 once you include integration, observability, evaluation infrastructure, and the security layer that wasn't in the original plan. Multi-agent workflows multiply each call. Latency optimization adds engineering headcount. Cost circuit breakers don't exist as standard tooling yet.

The teams that survive this are the ones budgeting for the system around the agent, not just the agent itself. A 1.5x multiplier on initial estimates is the minimum correction for realistic planning.

4. Multi-agent failures are distributed systems problems.

ICLR 2026 research formally classified multi-agent failure modes into four categories: specification ambiguity, organizational breakdown, inter-agent conflict, and weak verification. The most common pattern is error cascading — a small initial mistake snowballing through agents that lose access to the original evidence.

These are not LLM problems. These are the same failure modes that microservices teams have spent a decade learning to handle. State corruption. Silent data propagation. Lack of validation between services.

5. Standards are moving faster than their security models.

MCP has become the default integration protocol for agent-to-tool communication. Major platforms have adopted it. The community is building on it. A new Apps extension adds UI capabilities for agent clients.

It's also the protocol where Anthropic's own Git server shipped with prompt injection vulnerabilities that enabled file deletion and remote code execution. OWASP data shows prompt injection in 73% of production AI deployments assessed during security audits.

An integration standard needs a trust model. MCP doesn't have a mature one yet. Input validation, permission scoping, and output filtering are left to individual server implementers. Most skip them. Until the standard enforces security at the protocol level, every adoption is a bet that your team will do the hardening that most teams don't.

What to mandate now.

If you're running agents in production, these five controls are the governance minimum:

No agent goes live without a documented security review and IT sign-off.
Every agent action is traced to an auditable log tied to a specific agent identity.
Offline evals run before deployment and on a recurring schedule — not just observability.
Cost guardrails with per-agent and per-interaction budgets, enforced at runtime.
A kill switch that halts agent pipelines when behavior deviates from policy.

None of these are aspirational. They're the equivalent of what we already require for production APIs, database access, and infrastructure deployments. Agents aren't exempt because they're new.

The organizations that treat governance as a prerequisite, not a follow-up, are the ones that will still be running agents a year from now. The rest will have impressive demos and expensive post-mortems.