Reliability Engineering · Prepared for Don

Can you trust what your AI agent just did?

A defense-in-depth plan for verifying AI-agent work — triggered by the duplicated rdar at APL — plus a straight answer on how deep to actually invest.

27 Jun 2026 Status: Plan · awaiting go-ahead Grounded in NIST · OWASP · Anthropic

The bottom line

Go two layers deep now. Stop before the platform.

Recommendation: Build deterministic dedup (L0) + a selective human gate on irreversible/external actions (L3) first. That is roughly 20% of the enterprise stack for ~80% of the risk reduction on dup-class incidents. Add cheap audit + self-checks next. Defer the independent verifier-agent and the full eval platform unless WSI pays to productize it.

L0 Prevent
Build now
L3 Human gate
Build now
L4 Audit
Next (cheap)
L1 Self-check
Next (cheap)
L2 Verifier
Defer / WSI

The dial is your investment depth. Most teams over-build the expensive layers and under-build the cheap deterministic ones — which is exactly backwards for your kind of incident.

What actually happened

The rdar wasn't wrong. It was un-checked.

An agent created a duplicate radar at APL. The model's reasoning was fine — it just wrote the same side-effect twice because nothing made the write idempotent and nothing read before writing.

Root cause is not correctness — it is the absence of a guardrail. The duplicate side-effect is the bug. This is a classic missing-idempotency failure, the single most cited root cause for agents that take real-world actions.
IntentFile a radar
No read-before-writeDidn't check if it already exists
No idempotency keyRetry / re-run created a 2nd record
Duplicate rdarVisible, embarrassing, manual cleanup

Reframing the problem this way matters: you are not trying to make the agent smarter. You are trying to make its write-capable actions safe, checkable, and reversible. Verification is a property of the harness and tools, not the model.

The mental model

Verification is shared responsibility across 4 surfaces

Anthropic's agent security model splits an agent into four layers. The rdar dup lived in Tools/Harness, not the Model — which tells you where to spend.

MODEL

Reasoning

Was the plan sound? Usually yes. Hardest + most expensive to verify; lowest ROI for your incident.

HARNESS

Orchestration

Retries, loops, state. Where a single intent became two writes.

TOOLS

Actions / side-effects

The rdar create. Make these typed, idempotent, read-before-write. Highest leverage.

ENVIRONMENT

Where it runs

Permissions + audit trail. Catch and reconcile after the fact.

The framework

5 layers of defense-in-depth

No single check is sufficient — single-run agent success can be ~60% but collapses toward ~25% across repeated runs. You layer cheap deterministic checks first, expensive probabilistic ones last.

L0 · PREVENTION

Make bad writes impossible

Idempotency keys, a dedupe ledger, read-before-write, unique constraints, compare-and-set. Deterministic, no LLM, no cost per run.

rdar: hash (project+title+type) → if exists, return the existing id instead of creating.
Highest ROI · build now
L1 · SELF-VERIFY

Agent checks its own work

Typed input/output schema validation + read-back: after writing, fetch the record and confirm it matches intent. Self-reflection on tool results.

rdar: after create, query the radar back and assert exactly one match.
Cheap · bake into skills
L2 · INDEPENDENT VERIFIER

A second judge

LLM-as-judge / agent-as-judge reviews the action against the goal. Catches semantic mistakes deterministic checks miss — but adds latency, cost, and its own error rate.

rdar: a verifier confirms the radar matches the request before it is considered done.
Lower ROI solo · defer / WSI
L3 · HUMAN GATE

You approve the risky ones

Human-in-the-loop approval for irreversible / external / high-blast actions only. Must be selective — blanket prompts get rubber-stamped.

rdar: creating an external radar pauses for a one-tap confirm with a dedupe preview.
High ROI · build now
L4 · DETECT & AUDIT

Catch what slips through

Durable per-step ledger, nightly reconciliation sweeps, intent-drift / repeated-retry monitoring. Risk often develops across a sequence of individually valid steps.

rdar: a sweep flags any two radars with the same fingerprint within 24h.
Cheap insurance · next
Human-in-the-loop is necessary but not a crutch. In studied deployments ~93% of agent permission prompts were approved without being read. A gate only works if it fires rarely and shows high-signal context (like a dedupe preview).

How much, where

Route verification by blast radius × reversibility

Do not verify everything equally. Match rigor to the cost of being wrong. This table is the operational answer to "how deep."

Action typeExampleReversible?Verification depth
Read-onlySearch, summarize, queryn/aNone (trust)
Internal reversible writeEdit a Notion page, update a rowYesL0 + L1 read-back
External / hard-to-undoCreate rdar, post to Slack, push codeNo / costlyL0 + L3 gate + L4 log
Bulk / fan-outMass update, multi-record syncVariesL0 + L1 + L4 sweep

The investment plan

Three phases — and a clear place to stop

1
Now · hours

The 80/20. Stops the dup class outright.

  • Dedupe ledger + idempotency keys on create-type tools
  • Read-before-write on rdar / Slack / external actions
  • Define the L3 gate list (which actions pause)
Effort: low · Risk cut: highest
2
This week · ~1 day

Cheap insurance + self-checks.

  • Durable action ledger (every write-step logged)
  • Nightly reconciliation sweep for duplicates
  • L1 read-back assertions baked into skills
Effort: low-med · Risk cut: medium
3
Optional · WSI-funded

Only if you productize it.

  • Independent verifier (LLM-as-judge) pass
  • Minimum eval harness: tool-selection + context-relevance + faithfulness
  • Dashboards, regression suite
Effort: high · ROI: low for solo use
~80%
of dup-class risk removed by Phase 1 alone
~70%
of pre-launch failures caught by a minimum 3-metric eval (Phase 3)
93%
of permission prompts approved unread — gate selectively
60→25%
single-run vs repeated-run success — why you layer
Where to stop: Finish Phase 1 + 2. Treat Phase 3 as a WSI sales artifact, not personal infrastructure. For one operator, an LLM-judge on every action costs more (latency, money, its own mistakes) than it saves.

The reusable contract

Every write-capable skill declares its guarantees

Make verification a checklist each tool must satisfy — so new skills inherit safety by default.

FieldWhat it answers
Idempotency keyWhat makes a repeat call a no-op?
Read-back checkHow do we confirm the write landed exactly once?
Reversibility classReversible / costly / irreversible
Approval required?Does it trip the L3 human gate?
Audit entryWhat gets written to the ledger?

For WSI

This maps straight onto the Approval Gates pillar

The same framework — deterministic prevention, selective human gates, audit + reconciliation — is exactly what clients ask for when they say they want to "trust" agents. Phase 3 is the version you sell; Phases 1–2 are the version you run. Building it for yourself first makes it a credible, demoable reference.

Grounding

Industry best practices this draws on

NIST AI RMF — govern/map/measure/manage for AI risk. nist.gov/itl/ai-risk-management-framework
OWASP Agentic Top 10 (2026) — excessive agency, tool misuse, cascading failures. genai.owasp.org
Anthropic shared-responsibility model — model/harness/tools/environment; HITL approval data. backslash.security
Idempotent consumer pattern — dedupe ledger, keys, read-before-write. microservices.io
Agent evaluation — reliability decay, minimum metric set. galileo.ai · towardsdatascience.com
Tool-use validation — typed schemas are non-negotiable. mlflow.org
LLM/agent-as-judge — independent verification. arxiv 2508.02994 · deepeval.com
Risk across valid steps — intent drift, retries, sequencing. zenity.io · ibm.com/think

Definition of done

You will know it works when…

Next step: approve Phase 1 and I will wire the dedupe ledger + read-before-write into the create-type skills, then define the L3 gate list with you.