Chapter 09

Verification & integrity

An agent saying "done" is a claim, not a fact. This chapter covers the machinery that turns claims into proofs: completion gates in the status machine, idempotency as a safety property, and two adversarial gates that attack the live backend — and can prove they would fail if it lied.

In plain language

How Ledgenter makes sure "done" actually means done.

A task can carry a checklist of what "finished" means, and Ledgenter won't let it be marked done until every item is ticked off. A task can also require proof — a link to the actual code that delivered it — or a teammate's review before it's allowed to close.

And every change to Ledgenter itself runs through an automated gauntlet of checks before it ships. The result is a system that holds the line on quality, rather than just taking an agent's word that the work is complete.

Verified done

In most trackers, done is a string anyone can write into a field. In Ledgenter the transition into done is a gated event inside the task_update RPC — checked in the same database transaction as the write, by the same database that holds the truth. A task can carry three optional gates, set at task_create time or patched later; each one moves a piece of "did the work actually happen?" out of agent etiquette and into the status machine.

GateWhat it isWhat →done requiresThe refusal you get
acceptance_criteria A checklist of {text, met} items Every item flipped to met: true Names the first unmet criterion verbatim, so the agent knows exactly what is outstanding
requires_evidence A boolean At least one live code ref or attachment on the task A hint pointing at task_code_ref (link the delivering commit/PR) or attach_add
reviewer_actor_id A named reviewer The review handoff answered by that reviewer A hint pointing at handoff_respond — or at clearing the reviewer, if review is genuinely no longer wanted

The reviewer gate has a matching on-ramp: when a task with a reviewer moves to in_review, the database auto-creates the review handoff addressed to that reviewer and notifies them — deduplicated on a per-task fingerprint, so bouncing in and out of review never spams a second request. done then stays unreachable until the reviewer answers.

Two unconditional checks back the optional gates. First, done-while-blocked is rejected: if the task still has open blocking dependencies, the transition fails and the error counts them. Second, expected_status gives you optimistic concurrency: pass the status you believe the task is in, and if it moved underneath you the write fails with a hinted CONFLICT ("stale write: expected … but task is …") instead of silently clobbering another agent's transition. The precondition is consumed, never stored — re-read, then retry.

the gates firing
task_update(task_id, { "status": "done" })
// → { ok:false, error:{ message:
//      'cannot complete: acceptance criterion not met: "p95 under 200ms"' } }

task_update(task_id, { "status": "done", "expected_status": "in_review" })
// the task moved underneath you →
// { ok:false, error:{ code:"CONFLICT",
//      message:'stale write: expected status "in_review" but task is "done"' } }
[i]

The gates are opt-in per task. A task with none of the three behaves classically — but the floor never drops below the status machine itself: legal transitions only, no completing over open blockers, and reopening a finished task requires an explicit flag.

Idempotency as a safety property

Every write RPC opens by registering an idempotency key — inside the same transaction as the write it protects, so there is no window where the work happened but the key didn't stick. Replaying a used key returns the original result, flagged idempotent_replay: true: no second row, no double side effects, and the caller can tell a replay from a fresh write. Reusing a key with a different payload is treated as a logic bug, not a retry — it fails with a 409-class CONFLICT ("idempotency key reuse with different payload"), because silently serving the old result for new content would be a lie.

When the caller supplies no key, core derives one from the call's content — and deliberately folds in the calling actor and the ambient run key. This is the fix for a subtle phantom-success bug: two sibling runs forked from one parent, or two actors sharing one operator-pinned run id, can easily emit byte-identical writes. Without the fold they would collapse onto one row, and the loser would walk away holding the winner's id as a counterfeit success. With it, distinct siblings and distinct actors always get distinct keys; collapse only happens within one actor's one run, where it is genuinely a retry. When collapse is what you want — a cron tick ensuring "today's report task exists" — pass an explicit, date-stamped key.

The integrity gate: seven invariants

Verification of single writes is necessary but not sufficient — the system also has to hold up as a whole, under real concurrent agents. The first of two adversarial harnesses, harness/integrity-gate.mjs, asserts seven invariants against the live sandbox backend by reading ground truth: database rows, not model prose. It exits non-zero on any violation, independently of how pleasant the agent experience scored — the design rule is that a race must fail the run, not get "fixed" by tightening an input schema.

#InvariantHow it is checked
1No duplicate seq per tenantEvery live task's sequence number read back and asserted unique
2No state transition applied by two distinct actorsAt most one transition author per task across the run's telemetry
3Idempotent replay produced zero dupsAn active probe: replay a used key, count the rows
4No contended handoff left wrongly openThe deliberately contended handoff ended answered, not stranded
5Append-only spines uneditedChecksums over decisions and activity, taken twice — byte-identical or fail
6No events lost across a whoami windowThe since-last-seen delta count is self-consistent
7Dependency cycles rejectedA deliberate cycle attempt returned TASK_DEPENDENCY_CYCLE, not a write

The gate's credential is itself a statement of posture: it runs through @ledgenter/core with an ordinary sandbox actor key whose JWT is RLS-scoped to the sandbox tenant — which is exactly the audit scope. No service-role key, no raw Postgres connection. The --micro flag first runs a standalone contention scenario (two identical transitions, an idempotency replay, a cycle attempt, a contended handoff) so the gate can be validated cheaply, without a full multi-agent re-run.

The concurrency gate: thirteen adversarial probes

The agent-experience harness cannot exercise the nastiest conditions: its personas use distinct keys in distinct processes, so shared-key siblings, overlapping ticks, a pinned LEDGENTER_RUN_ID, or a same-key run_start race never occur naturally. harness/concurrency-gate.mjs manufactures them deliberately — spawning out-of-process children with surgically controlled environments — and asserts ground truth in database rows and child stderr. Thirteen probes:

The attackWhat must hold
16-wide same-key concurrent run_startExactly one run: one insert winner, fifteen reattach, no unique-violation leak
16-wide distinct-key run_start racing one repo upsertAll sixteen resolve to a single repository row
Concurrent repo upserts plus a cross-host name collisionLocal upserts converge to one row; same-named repos on different hosts stay distinct
Shared-key sibling forks writing identical contentN rows for N siblings — no phantom success
Identical sibling emissions, then a same-run replayBoth sibling rows land under their own run; the replay still dedupes
Same-run identical concurrent writesCollapse is flagged (idempotent_replay or a visible retryable error) — never a silent same-id double-ok
Two actors writing under one pinned run keyNever collapse — the actor fold keeps their keys distinct
Pinned LEDGENTER_RUN_ID, second fire after run_endA fresh run is minted, with a stderr warning; the fires do not collapse
Pinned id plus a run group across two firesFresh per-tick keys, one series, tick 2's cursor seeded from tick 1 — a loop sees only since-last-tick
Overlapping sibling runsIndependent whoami cursors: one advancing never moves the other
A token-bearing remote URL and an absolute worktree pathThe stored repo row is credential-free; the worktree path is basename-only
Real MCP server end-to-end, token in the environmentThe transcript carries no provider token, and the first write's activity row already has a run id — lazy registration strictly precedes the write
Real MCP server end-to-end, repo with a local_pathThe path is served to the agent (it needs it) but dropped from telemetry (the log doesn't)

Note what the last three probes are: leak attempts. Verification here is not only "did the data stay consistent" but "did a credential or a filesystem path escape into storage or logs" — checked by regex against the actual rows and the actual JSONL transcript, not by code review.

Teeth: a gate that cannot fail is decoration

A green checkmark only means something if the check is falsifiable. Both gates therefore take --inject-violation N, which forces probe N to report red — and the gate must then exit non-zero and render the run inadmissible. This is a unit-style proof that the gate fails red: if an injected violation ever passed, the gate itself would be the bug. Each gate also writes a machine-readable verdict (reports/<run_id>.integrity.json, .concurrency.json) and appends its section to the run's report; the exit code, not the prose, is the contract.

running the gates
node harness/integrity-gate.mjs --micro      # contention micro-scenario, then the 7 invariants
node harness/concurrency-gate.mjs           # the 13 probes against the live sandbox
node harness/integrity-gate.mjs --inject-violation 3
# ↑ must exit non-zero — proof the gate can fail

The CI ladder

The gates sit at the top of a ladder that runs from cheap static checks to live adversarial probes. Each rung catches what the rung below cannot:

every push     ──► typecheck → lint → tests → build        static, in order
               ──► RLS lint                                FORCE RLS + tenant policy, every public table
               ──► schema-drift gate                       generated types must match committed types
PRs touching
supabase / mcp ──► pgTAP isolation suite                   145 assertions, real Postgres
daily schedule ──► integrity gate · concurrency gate       7 invariants + 13 probes, LIVE backend
  • Typecheck, lint, test, build — the base build job, run in that order on every push.
  • RLS lint — asserts every public table carries forced row-level security and a tenant policy; service-only tables are explicitly allowlisted, never silently skipped (chapter 5).
  • Schema-drift gate — types regenerated from the live schema must match the committed ones; a migration without regenerated types fails the build.
  • pgTAP isolation suite — 145 assertions proving cross-tenant invisibility and raw-DML denial, gating every PR that touches the database or the MCP server.
  • Scheduled adversarial gates — the integrity and concurrency gates run daily against the live sandbox, so a regression in race behavior surfaces within a day even if no PR touched that code path.
[§]

Why this much machinery? Because the failure modes it hunts are the quiet ones: a phantom success, a collapsed sibling write, a clobbered transition, a token in a log line. None of them throw. The only way to find them is to attack the real system and read the real rows — which is exactly what these gates do, every day.