1. What lifecycle management means for AI agents
AI agent lifecycle management is the operational discipline of running an agent from the moment it's proposed through the moment it's safely retired. It is the counterpart to the governance framework: the framework defines the policies and controls; lifecycle management is how you actually exercise them over time.
Without explicit lifecycle management, the agent that passed review on day one is not the agent running on day ninety. The model updates, the tool list creeps, the data scope expands, the policy bundle ages — and one day a regulator or a customer asks a question the runtime can no longer answer.
2. The five phases at a glance
Design
Identity, scope, policies, success criteria, retirement trigger.
Build
Compose tools, write policies as code, test in a dev environment.
Deploy
Promote through environments, canary, baseline.
Operate
Posture, drift detection, incidents, updates, periodic review.
Retire
Revoke credentials, archive logs, document why.
3. Phase 1 — Design
Most lifecycle failures are designed in. The design phase pays back the most invested time. Six artefacts you should have on file before any code is written:
- Identity. A unique credential for the agent, separate from any human user. Naming convention that survives reorgs.
- Scope statement. Plain-English one-paragraph description of what the agent is for, signed off by the business owner.
- Tool allowlist. The minimal set of MCP tools or actions the agent needs. Default-deny everything else.
- Data scope. Which data sources the agent can read, at what row-level granularity, with which redactions applied.
- Success criteria. The two or three measurable outcomes that mean the agent is doing its job — and the threshold below which it gets pulled.
- Retirement trigger. The condition under which the agent is retired — a date, a metric, a successor product, or a regulatory change.
Note the retirement trigger up front. Agents without retirement triggers tend to live forever, accumulating capabilities the original review never approved.
4. Phase 2 — Build
Build phase is where the design artefacts become runnable. The governance layer has three jobs here:
- Policy as code. The scope statement and tool allowlist are represented as a version-controlled policy bundle. Reviewable, diffable, revertable. No production agent runs against a policy that was edited in the UI without a commit trail.
- Tool brokering rehearsal. Every MCP tool the agent is allowed to call is invoked at least once in a sandbox that exercises the full policy chain — redaction, allowlist enforcement, audit emission.
- Failure rehearsal. The agent is intentionally given a prompt that should be blocked. The block must produce a logged event with the right control-ID tag. If it doesn't, the audit chain is broken before the agent ever ships.
5. Phase 3 — Deploy
Deployment in a regulated environment is not a single push. It is a sequenced promotion that produces evidence at every step:
Dev → Staging
Agent runs against synthetic data. Policy enforcement is on. All blocks and allows logged. The log is itself reviewed before promotion.
Staging → Canary
Agent runs against a small slice of real traffic — usually one user group or one BU subset. Posture and incident-rate baselines are taken here.
Canary → Production
Agent runs against full traffic. Drift detection turns on. The baseline from canary becomes the alerting threshold for operate phase.
6. Phase 4 — Operate
Operating an agent is roughly five sub-disciplines, each one a thing that turns into an incident if you ignore it:
- Posture management. A live view of every control the agent runs under — identity, tool allowlist, data scope, redaction rules, policy version. The dashboard answers "is this agent compliant right now?" at a glance.
- Drift detection. When the model version changes, the tool list changes, the data scope changes, or the policy bundle changes, the platform emits an event. Unreviewed drift is the most common cause of enterprise incidents.
- Incident response. When an agent misbehaves, the runbook is: pause the agent, freeze its identity, pull the audit log, write the incident note, decide on remediation. Every step should be a single click in the governance platform.
- Updates. Tool additions, data-scope expansions, and policy edits go through the same review the original agent did. The governance platform should not let an update ship without it.
- Periodic review. Every quarter, run an agent-to-agent audit against the current policy bundle. Confirm the agent is still doing what its scope statement said and still meeting success criteria.
See the best practices playbook for the concrete metrics to alert on during operate phase.
7. Phase 5 — Retire
Retirement is the phase teams routinely skip. It is also the one that protects you the most from zombie-agent incidents — an agent that no one owns but is still calling tools and reading data. Five steps to retire an agent cleanly:
- Announce retirement — internal owner, business owner, any regulator that was notified of the agent's existence.
- Revoke the agent identity so no further tool call can succeed. The platform should enforce this at the proxy layer, not at the application.
- Archive the audit log against the strictest retention schedule that applies. The log outlives the agent.
- Document the retirement trigger — what condition fired, what evidence was used, who signed it off. This pairs with the retirement-trigger artefact from design phase.
- Confirm zero outbound traffic from the agent identity for at least seven days. Anything that fires after retirement is a finding.
8. Anti-patterns to avoid
- Treating governance as a launch gate, not a runtime concern. A one-time pre-launch review tells you nothing about the agent on day ninety.
- Shared identities across multiple agents. If you can't tell which agent did something, you can't run lifecycle management on either of them.
- Manual tool-list updates with no review. Tool creep is the most common drift mode. Treat tool additions like privilege escalation requests.
- No retirement trigger. Agents without explicit retirement criteria do not retire on their own. They accumulate capabilities and risk until they fail an audit.
- Posture without history. A dashboard that only shows current state cannot answer "when did this drift start?" — which is the first question every incident review asks.