Your Agent Stack Is Infrastructure. Runtime, Identity, and State Are Now Platform Decisions

The easiest way to misunderstand agentic AI is to think the hard part is getting the model to do the task. That is demo logic.
In production, the hard part is being able to explain what the agent did, constrain what it was allowed to do, and recover cleanly when it does the wrong thing. Once an agent can hold state across steps, touch real systems, and take actions with real credentials, the system stops being a prompt problem. It becomes a runtime problem.
You can see the market moving there now. OpenAI's trusted-access announcement is really a statement about controlled execution, not just model capability.
The old story was model + prompt + tools. That story is breaking.
What broke
I've seen a version of the same failure more than once.
A team builds an internal agent to triage incoming support issues. It reads the ticket, pulls account context from Salesforce, opens a Jira issue when engineering work is needed, drafts the customer reply, and suggests whether the case should be escalated. In the pilot, it looks great. The agent handles the happy path, the team saves time, and everyone starts talking about broader rollout.
The agent pulls stale account context, opens the wrong issue, drafts a reply that references the wrong entitlement, and marks the case as lower priority than it should be. Now the team wants to know four things:
- Which credential actually made the call?
- What intermediate state did the agent carry from one step to the next?
- Which tool result caused the bad branch?
- Can we replay the run safely after the fix?
If nobody can answer those questions, that is not model failure. That is runtime failure.
The model may have done exactly what the system allowed it to do. The platform was too loose to make the run observable, governable, and recoverable.
The five decisions that matter now
Most teams still talk about agent architecture as if the interesting part is model selection. That matters, but once the agent is allowed to do real work, the architecture decisions move somewhere else.
- Identity. Who actually owns credentials and permission scope? If the agent is running with a shared service account and nobody can tie actions back to a workflow, team, or approval boundary, you are going to have governance problems fast.
- State. Where does task state live between steps? If the only "memory" is whatever happened to fit in the current context window, the system will drift, duplicate work, and lose continuity the first time the workflow gets messy.
- Approval. Which actions require a human to confirm before execution? Not every workflow needs a hard gate, but every production workflow needs a clear answer about where autonomous action stops.
- Visibility. What execution trace do you keep? You need more than logs that say a tool was called. You need enough evidence to understand the chain of reasoning, the tool outputs that mattered, and the state transitions that changed the result.
- Recovery. How does a run get replayed, resumed, or rolled back? If the answer is "we just rerun the prompt," the system is not ready.
The agent is no longer just a model with tool calls attached. It is sitting inside a control plane.
What production teams ship instead
They give the agent scoped credentials, not broad standing access. They use an explicit state store instead of pretending the context window is a durable system of record. They force tools to return structured outputs so downstream steps do not depend on brittle text parsing. They create a human approval service for high-impact actions instead of burying the checkpoint inside ad hoc UI logic. And they keep an execution trace that makes postmortems possible.
Phase 1 is usually read, summarize, classify, or prepare. Phase 2 drafts and recommends. Phase 3 acts inside a bounded workflow with clear rollback and audit semantics. That sequence is slower than the "let's just connect the tools and see what happens" approach. It is also how teams avoid the incident pileup that kills trust six weeks later.
This is why I don't think the next bottleneck in agent systems is raw capability. The bottleneck is whether the execution layer around the model is solid enough to survive contact with the business.
Build versus buy
A managed runtime is often enough when the workflow is narrow, the tool surface is small, and the approval boundaries are obvious. A lot of internal drafting, summarization, retrieval, and low-risk workflow prep can live there comfortably.
Custom runtime control becomes more important when one of three things is true:
- the workflow touches sensitive systems or regulated data
- the agent needs durable multi-step state across sessions
- the team needs precise replay, audit, or rollback behavior
This is where teams get into trouble. They tell themselves they are still "just prototyping," then quietly add real credentials, more tools, and higher-impact actions. Six weeks later they have a production-shaped system without production-shaped controls.
That is how platform debt shows up in agent programs. It looks like flakiness first, then becomes a debugging and governance problem.
What to measure
If you want to know whether the runtime is good enough, don't ask whether the demo worked. Measure the operating mechanics.
Look at task completion rate, but pair it with human override rate. Look at incident or retry rate when tools fail or state goes stale. Track cost per successful task, not just model spend. And for engineering teams, I would add one metric that matters more than people admit: mean time to explain a bad run.
This last one is your most powerful metric to measure
When a workflow goes sideways, how long does it take to reconstruct what happened? If the answer is hours of combing through logs, Slack messages, and improvised traces, the system is too opaque.
This is also where review changes. Good review for agent systems is not "does the prompt look smart?" It is:
- can I inspect carried-forward state?
- can I see the action boundary clearly?
- can I tell what needed approval?
- can I replay the run after a fix?
If the answer is no, the system is still a prototype.
The new standard
The better production question is simple: can the platform make the task safe, observable, and repeatable?
That is why runtime, identity, and state are now platform decisions. They determine whether an agent can be governed, debugged, and trusted at scale. If your team cannot explain who held the credentials, where state lived, what required approval, and how a run gets replayed, you do not have an agent platform yet.
You have a demo with admin rights.
