A defense-in-depth plan for verifying AI-agent work — triggered by the duplicated rdar at APL — plus a straight answer on how deep to actually invest.
27 Jun 2026Status: Plan · awaiting go-aheadGrounded in NIST · OWASP · Anthropic
The bottom line
Go two layers deep now. Stop before the platform.
Recommendation: Build deterministic dedup (L0) + a selective human gate on irreversible/external actions (L3) first. That is roughly 20% of the enterprise stack for ~80% of the risk reduction on dup-class incidents. Add cheap audit + self-checks next. Defer the independent verifier-agent and the full eval platform unless WSI pays to productize it.
L0 Prevent
Build now
L3 Human gate
Build now
L4 Audit
Next (cheap)
L1 Self-check
Next (cheap)
L2 Verifier
Defer / WSI
The dial is your investment depth. Most teams over-build the expensive layers and under-build the cheap deterministic ones — which is exactly backwards for your kind of incident.
What actually happened
The rdar wasn't wrong. It was un-checked.
An agent created a duplicate radar at APL. The model's reasoning was fine — it just wrote the same side-effect twice because nothing made the write idempotent and nothing read before writing.
Root cause is not correctness — it is the absence of a guardrail. The duplicate side-effect is the bug. This is a classic missing-idempotency failure, the single most cited root cause for agents that take real-world actions.
IntentFile a radar
→
No read-before-writeDidn't check if it already exists
→
No idempotency keyRetry / re-run created a 2nd record
Reframing the problem this way matters: you are not trying to make the agent smarter. You are trying to make its write-capable actions safe, checkable, and reversible. Verification is a property of the harness and tools, not the model.
The mental model
Verification is shared responsibility across 4 surfaces
Anthropic's agent security model splits an agent into four layers. The rdar dup lived in Tools/Harness, not the Model — which tells you where to spend.
MODEL
Reasoning
Was the plan sound? Usually yes. Hardest + most expensive to verify; lowest ROI for your incident.
HARNESS
Orchestration
Retries, loops, state. Where a single intent became two writes.
TOOLS
Actions / side-effects
The rdar create. Make these typed, idempotent, read-before-write. Highest leverage.
ENVIRONMENT
Where it runs
Permissions + audit trail. Catch and reconcile after the fact.
The framework
5 layers of defense-in-depth
No single check is sufficient — single-run agent success can be ~60% but collapses toward ~25% across repeated runs. You layer cheap deterministic checks first, expensive probabilistic ones last.
L0 · PREVENTION
Make bad writes impossible
Idempotency keys, a dedupe ledger, read-before-write, unique constraints, compare-and-set. Deterministic, no LLM, no cost per run.
rdar: hash (project+title+type) → if exists, return the existing id instead of creating.
Highest ROI · build now
L1 · SELF-VERIFY
Agent checks its own work
Typed input/output schema validation + read-back: after writing, fetch the record and confirm it matches intent. Self-reflection on tool results.
rdar: after create, query the radar back and assert exactly one match.
Cheap · bake into skills
L2 · INDEPENDENT VERIFIER
A second judge
LLM-as-judge / agent-as-judge reviews the action against the goal. Catches semantic mistakes deterministic checks miss — but adds latency, cost, and its own error rate.
rdar: a verifier confirms the radar matches the request before it is considered done.
Lower ROI solo · defer / WSI
L3 · HUMAN GATE
You approve the risky ones
Human-in-the-loop approval for irreversible / external / high-blast actions only. Must be selective — blanket prompts get rubber-stamped.
rdar: creating an external radar pauses for a one-tap confirm with a dedupe preview.
High ROI · build now
L4 · DETECT & AUDIT
Catch what slips through
Durable per-step ledger, nightly reconciliation sweeps, intent-drift / repeated-retry monitoring. Risk often develops across a sequence of individually valid steps.
rdar: a sweep flags any two radars with the same fingerprint within 24h.
Cheap insurance · next
Human-in-the-loop is necessary but not a crutch. In studied deployments ~93% of agent permission prompts were approved without being read. A gate only works if it fires rarely and shows high-signal context (like a dedupe preview).
How much, where
Route verification by blast radius × reversibility
Do not verify everything equally. Match rigor to the cost of being wrong. This table is the operational answer to "how deep."
Action type
Example
Reversible?
Verification depth
Read-only
Search, summarize, query
n/a
None (trust)
Internal reversible write
Edit a Notion page, update a row
Yes
L0 + L1 read-back
External / hard-to-undo
Create rdar, post to Slack, push code
No / costly
L0 + L3 gate + L4 log
Bulk / fan-out
Mass update, multi-record sync
Varies
L0 + L1 + L4 sweep
The investment plan
Three phases — and a clear place to stop
1
Now · hours
The 80/20. Stops the dup class outright.
Dedupe ledger + idempotency keys on create-type tools
Read-before-write on rdar / Slack / external actions
of pre-launch failures caught by a minimum 3-metric eval (Phase 3)
93%
of permission prompts approved unread — gate selectively
60→25%
single-run vs repeated-run success — why you layer
Where to stop: Finish Phase 1 + 2. Treat Phase 3 as a WSI sales artifact, not personal infrastructure. For one operator, an LLM-judge on every action costs more (latency, money, its own mistakes) than it saves.
The reusable contract
Every write-capable skill declares its guarantees
Make verification a checklist each tool must satisfy — so new skills inherit safety by default.
Field
What it answers
Idempotency key
What makes a repeat call a no-op?
Read-back check
How do we confirm the write landed exactly once?
Reversibility class
Reversible / costly / irreversible
Approval required?
Does it trip the L3 human gate?
Audit entry
What gets written to the ledger?
For WSI
This maps straight onto the Approval Gates pillar
The same framework — deterministic prevention, selective human gates, audit + reconciliation — is exactly what clients ask for when they say they want to "trust" agents. Phase 3 is the version you sell; Phases 1–2 are the version you run. Building it for yourself first makes it a credible, demoable reference.
Grounding
Industry best practices this draws on
NIST AI RMF — govern/map/measure/manage for AI risk. nist.gov/itl/ai-risk-management-framework