The Agent Trap: A production readiness checklist (with gating levels)
Published: January 15, 2026
We’ve all been there. You spend an afternoon hacking together an AI agent. You hook up a few tools, write a clever system prompt, and—magic. It correctly retrieves a Jira ticket, summarizes it, and drafts a Slack message. It feels like the future.
Then you ship it.
Two days later, the agent hallucinates a policy violation, loops until it burns $50 in tokens, or crashes because the LLM returned json instead of JSON.
The gap between a Twitter demo and a reliable service is massive. It is filled with edge cases, rate limits, malicious inputs, and confused users. This post proposes a practical path to cross that gap: a maturity model for your agent and a hard checklist to gate your releases.
Redefining "Production Ready"
In traditional software, "production ready" often means the code passes tests and the server is up. For probabilistic software (AI), the definition must shift.
"Production ready" for an agent is not about how smart it is. It's about how boring it is.
- Determinism: Can you bound the chaos?
- Containment: When (not if) it fails, is the blast radius limited?
- Explainability: Six months from now, can you trace why it made that specific decision?
- Economics: Will a spike in traffic bankrupt your API budget?
The Gating Levels: A Maturity Model
Don't try to go from localhost to global scale in one leap. Use these levels to communicate risk and readiness to your stakeholders.
Level 0 — The Prototype (The Lab)
The Goal: Prove the idea isn't terrible.
This is your internal hackathon project. The code is messy, the prompts are brittle, and that's okay.
- Capabilities: It does one specific thing (inputs and outputs are defined).
- Testing: You have a manual list of "weird things I tried that broke it."
- Audience: Internal developers only. No customer data.
Level 1 — Controlled Pilot (The Training Wheels)
The Goal: Reality checks without the risk.
This is where most teams fail. They skip this and go live. Level 1 introduces Shadow Mode or strict Human-in-the-Loop (HITL).
- Shadow Mode: The agent runs in the background on real traffic. It processes data and logs what it would have done, but takes no action. This builds your evaluation dataset without angering users.
- HITL: If it must act, a human reviews the draft. Every single time.
- Safety: You have a strict "allowlist" of tools. No generic "browse the web" capabilities unless absolutely necessary.
Level 2 — Limited Production (The Soft Launch)
The Goal: Reliability as a Service.
Now you have real users, but you have guardrails.
- Reliability: You have defined SLOs (e.g., 95% of answers generated in <5s).
- Resilience: You have "smart retries." If the LLM outputs bad JSON, you don't just crash—you feed the error back to the LLM and ask it to correct itself.
- Evaluation: You have a "Golden Set" of test cases running in CI. If a prompt change drops accuracy by 2%, the build fails.
- Audience: Beta users, non-critical workflows, or a small % of traffic.
Level 3 — Scaled Production (The Platform)
The Goal: Scale without scaling incidents.
This is enterprise-grade.
- Operations: You have on-call runbooks for when the agent goes rogue.
- Defense: You are actively red-teaming your own prompts to find injection vulnerabilities.
- Economics: You handle provider rate limits gracefully (queues, backoff, load shedding).
The Readiness Checklist
If you are moving to Level 2 (Production), these items are not optional.
1. Scope & Success
- [P0] Bounded Job: The agent has a single, one-sentence purpose. Avoid building "god-bots."
- [P0] Contract: Inputs and outputs are strictly typed (using schemas like Pydantic or Zod).
- [P0] The "I Don't Know" Path: The agent is explicitly instructed on how to decline tasks it cannot do, rather than hallucinating an attempt.
2. Tooling & Side Effects
- [P0] No Open Access: Every tool available to the agent is on an allowlist. No
exec()or generic HTTP requests. - [P0] Validation First: The agent’s tool inputs are validated against a schema before the tool executes.
- [P0] Idempotency: If the agent retries a "Create Ticket" step, you don't end up with 5 duplicate tickets.
- [P1] Circuit Breakers: If a tool fails 10 times in a minute, the agent stops trying to call it.
3. Data & Privacy
- [P0] Data Mapping: You know exactly what data enters the context window.
- [P0] Sanitization: PII and Secrets are redacted before they hit the LLM provider's API.
- [P0] Least Privilege: The agent uses API keys with the absolute minimum scope required (e.g., Read-Only access to the database).
4. Safety & Defense
- [P0] Injection Defense: User input is clearly demarcated in the prompt (e.g., wrapped in XML tags) so the model knows what is instruction vs. what is data.
- [P0] Secret Safety: The agent cannot access secrets or environment variables, even indirectly.
- [P1] Escalation Path: When the agent refuses a request due to safety policies, does the user get a helpful error or a silent failure?
5. Evaluation & Versioning
- [P0] The Golden Set: You have a dataset of inputs (happy paths, edge cases, and adversarial attacks) with known-good outputs.
- [P0] CI/CD Integration: You run the Golden Set against every prompt or model change.
- [P0] Versioning: You treat prompts like code. You know exactly which version of a prompt generated a specific output in your logs.
6. Human-in-the-Loop (HITL)
- [P0] Visibility: If a human needs to intervene, the UI shows them the full context—what the user asked, and what the agent tried to do.
- [P0] Override Authority: The human can edit the agent's proposed action or discard it entirely.
- [P1] Feedback Loops: When a human corrects the agent, that data is captured to improve future evaluation sets.
7. Observability
- [P0] Tracing: You have end-to-end traces (using tools like LangSmith, Arize, or Honeycomb) that link the user request -> model thought -> tool call -> result.
- [P1] Cost Monitoring: You track token usage per feature/tenant and have alerts for cost spikes.
- [P1] Quota Management: You handle provider rate limits before they crash your app.
The "Go / No-Go" Decision
Before you hit deploy, look at your team and ask:
- 🟢 GO: We have a Golden Set that passes. We have forced the agent to use schemas. If it fails, we have traces to see why.
- 🟡 PILOT ONLY: It works well, but we don't have automated tests yet. We will only ship this with strict Human-in-the-Loop approval for every action.
- 🔴 NO GO: The agent has write access to the database, but we haven't tested what happens if the LLM hallucinates SQL. (Do not ship.)
Final Thoughts
The most important step in this entire list is Evaluation (Section 5).
Most teams skip building a test harness because it's boring. They prefer tweaking prompts to see if "it feels better." Do not fall into this trap. You cannot engineer reliability if you cannot measure it. Build the harness, then build the agent.