Reliability Engineering · Prepared for Don

Can you trust what your AI agent just did?

A defense-in-depth plan for verifying AI-agent work — triggered by the duplicated rdar at APL — plus a straight answer on how deep to actually invest.

27 Jun 2026 Status: Plan · awaiting go-ahead Grounded in NIST · OWASP · Anthropic

The bottom line

Go two layers deep now. Stop before the platform.

Recommendation: Build deterministic dedup (L0) + a selective human gate on irreversible/external actions (L3) first. That is roughly 20% of the enterprise stack for ~80% of the risk reduction on dup-class incidents. Add cheap audit + self-checks next. Defer the independent verifier-agent and the full eval platform unless WSI pays to productize it.

L0 Prevent

Build now

L3 Human gate

Build now

L4 Audit

Next (cheap)

L1 Self-check

Next (cheap)

L2 Verifier

Defer / WSI

The dial is your investment depth. Most teams over-build the expensive layers and under-build the cheap deterministic ones — which is exactly backwards for your kind of incident.

What actually happened

The rdar wasn't wrong. It was un-checked.

An agent created a duplicate radar at APL. The model's reasoning was fine — it just wrote the same side-effect twice because nothing made the write idempotent and nothing read before writing.

Root cause is not correctness — it is the absence of a guardrail. The duplicate side-effect is the bug. This is a classic missing-idempotency failure, the single most cited root cause for agents that take real-world actions.

IntentFile a radar

→

No read-before-writeDidn't check if it already exists

→

No idempotency keyRetry / re-run created a 2nd record

→

Duplicate rdarVisible, embarrassing, manual cleanup

Reframing the problem this way matters: you are not trying to make the agent smarter. You are trying to make its write-capable actions safe, checkable, and reversible. Verification is a property of the harness and tools, not the model.

The mental model

Verification is shared responsibility across 4 surfaces

Anthropic's agent security model splits an agent into four layers. The rdar dup lived in Tools/Harness, not the Model — which tells you where to spend.

MODEL

Reasoning

Was the plan sound? Usually yes. Hardest + most expensive to verify; lowest ROI for your incident.

HARNESS

Orchestration

Retries, loops, state. Where a single intent became two writes.

TOOLS

Actions / side-effects

The rdar create. Make these typed, idempotent, read-before-write. Highest leverage.

ENVIRONMENT

Where it runs

Permissions + audit trail. Catch and reconcile after the fact.

The framework

5 layers of defense-in-depth

No single check is sufficient — single-run agent success can be ~60% but collapses toward ~25% across repeated runs. You layer cheap deterministic checks first, expensive probabilistic ones last.

L0 · PREVENTION

Make bad writes impossible

Idempotency keys, a dedupe ledger, read-before-write, unique constraints, compare-and-set. Deterministic, no LLM, no cost per run.

rdar: hash (project+title+type) → if exists, return the existing id instead of creating.

Highest ROI · build now

L1 · SELF-VERIFY

Agent checks its own work

Typed input/output schema validation + read-back: after writing, fetch the record and confirm it matches intent. Self-reflection on tool results.

rdar: after create, query the radar back and assert exactly one match.

Cheap · bake into skills

L2 · INDEPENDENT VERIFIER

A second judge

LLM-as-judge / agent-as-judge reviews the action against the goal. Catches semantic mistakes deterministic checks miss — but adds latency, cost, and its own error rate.

rdar: a verifier confirms the radar matches the request before it is considered done.

Lower ROI solo · defer / WSI

L3 · HUMAN GATE

You approve the risky ones

Human-in-the-loop approval for irreversible / external / high-blast actions only. Must be selective — blanket prompts get rubber-stamped.

rdar: creating an external radar pauses for a one-tap confirm with a dedupe preview.

High ROI · build now

L4 · DETECT & AUDIT

Catch what slips through

Durable per-step ledger, nightly reconciliation sweeps, intent-drift / repeated-retry monitoring. Risk often develops across a sequence of individually valid steps.

rdar: a sweep flags any two radars with the same fingerprint within 24h.

Cheap insurance · next

Human-in-the-loop is necessary but not a crutch. In studied deployments ~93% of agent permission prompts were approved without being read. A gate only works if it fires rarely and shows high-signal context (like a dedupe preview).

How much, where

Route verification by blast radius × reversibility

Do not verify everything equally. Match rigor to the cost of being wrong. This table is the operational answer to "how deep."

Action type	Example	Reversible?	Verification depth
Read-only	Search, summarize, query	n/a	None (trust)
Internal reversible write	Edit a Notion page, update a row	Yes	L0 + L1 read-back
External / hard-to-undo	Create rdar, post to Slack, push code	No / costly	L0 + L3 gate + L4 log
Bulk / fan-out	Mass update, multi-record sync	Varies	L0 + L1 + L4 sweep

The investment plan

Three phases — and a clear place to stop

Now · hours

The 80/20. Stops the dup class outright.

Dedupe ledger + idempotency keys on create-type tools
Read-before-write on rdar / Slack / external actions
Define the L3 gate list (which actions pause)

Effort: low · Risk cut: highest

This week · ~1 day

Cheap insurance + self-checks.

Durable action ledger (every write-step logged)
Nightly reconciliation sweep for duplicates
L1 read-back assertions baked into skills

Effort: low-med · Risk cut: medium

Optional · WSI-funded

Only if you productize it.

Independent verifier (LLM-as-judge) pass
Minimum eval harness: tool-selection + context-relevance + faithfulness
Dashboards, regression suite

Effort: high · ROI: low for solo use

~80%

of dup-class risk removed by Phase 1 alone

~70%

of pre-launch failures caught by a minimum 3-metric eval (Phase 3)

93%

of permission prompts approved unread — gate selectively

60→25%

single-run vs repeated-run success — why you layer

Where to stop: Finish Phase 1 + 2. Treat Phase 3 as a WSI sales artifact, not personal infrastructure. For one operator, an LLM-judge on every action costs more (latency, money, its own mistakes) than it saves.

The reusable contract

Every write-capable skill declares its guarantees

Make verification a checklist each tool must satisfy — so new skills inherit safety by default.

Field	What it answers
Idempotency key	What makes a repeat call a no-op?
Read-back check	How do we confirm the write landed exactly once?
Reversibility class	Reversible / costly / irreversible
Approval required?	Does it trip the L3 human gate?
Audit entry	What gets written to the ledger?

For WSI

This maps straight onto the Approval Gates pillar

The same framework — deterministic prevention, selective human gates, audit + reconciliation — is exactly what clients ask for when they say they want to "trust" agents. Phase 3 is the version you sell; Phases 1–2 are the version you run. Building it for yourself first makes it a credible, demoable reference.

Grounding

Industry best practices this draws on

NIST AI RMF — govern/map/measure/manage for AI risk. nist.gov/itl/ai-risk-management-framework

OWASP Agentic Top 10 (2026) — excessive agency, tool misuse, cascading failures. genai.owasp.org

Anthropic shared-responsibility model — model/harness/tools/environment; HITL approval data. backslash.security

Idempotent consumer pattern — dedupe ledger, keys, read-before-write. microservices.io

Agent evaluation — reliability decay, minimum metric set. galileo.ai · towardsdatascience.com

Tool-use validation — typed schemas are non-negotiable. mlflow.org

LLM/agent-as-judge — independent verification. arxiv 2508.02994 · deepeval.com

Risk across valid steps — intent drift, retries, sequencing. zenity.io · ibm.com/think

Definition of done

You will know it works when…

Re-running the same agent task creates zero duplicate side-effects.
Every external/irreversible action is either auto-deduped or pauses for a one-tap confirm.
A nightly sweep reports duplicates = 0, with a ledger you can audit.
Adding a new write-capable skill forces filling in the Verification Contract.

Next step: approve Phase 1 and I will wire the dedupe ledger + read-before-write into the create-type skills, then define the L3 gate list with you.