Verification & integrity
An agent saying "done" is a claim, not a fact. This chapter covers the machinery that turns claims into proofs: completion gates in the status machine, idempotency as a safety property, and two adversarial gates that attack the live backend — and can prove they would fail if it lied.
How Ledgenter makes sure "done" actually means done.
A task can carry a checklist of what "finished" means, and Ledgenter won't let it be marked done until every item is ticked off. A task can also require proof — a link to the actual code that delivered it — or a teammate's review before it's allowed to close.
And every change to Ledgenter itself runs through an automated gauntlet of checks before it ships. The result is a system that holds the line on quality, rather than just taking an agent's word that the work is complete.
Verified done
In most trackers, done is a string anyone can write into a field. In Ledgenter
the transition into done is a gated event inside the
task_update RPC — checked in the same database transaction as the write, by the
same database that holds the truth. A task can carry three optional gates, set at
task_create time or patched later; each one moves a piece of "did the work
actually happen?" out of agent etiquette and into the status machine.
| Gate | What it is | What →done requires | The refusal you get |
|---|---|---|---|
acceptance_criteria |
A checklist of {text, met} items |
Every item flipped to met: true |
Names the first unmet criterion verbatim, so the agent knows exactly what is outstanding |
requires_evidence |
A boolean | At least one live code ref or attachment on the task | A hint pointing at task_code_ref (link the delivering commit/PR) or
attach_add |
reviewer_actor_id |
A named reviewer | The review handoff answered by that reviewer | A hint pointing at handoff_respond — or at clearing the reviewer, if
review is genuinely no longer wanted |
The reviewer gate has a matching on-ramp: when a task with a reviewer moves to
in_review, the database auto-creates the review handoff addressed to
that reviewer and notifies them — deduplicated on a per-task fingerprint, so bouncing in and
out of review never spams a second request. done then stays unreachable until
the reviewer answers.
Two unconditional checks back the optional gates. First, done-while-blocked is
rejected: if the task still has open blocking dependencies, the transition fails and
the error counts them. Second, expected_status gives you optimistic
concurrency: pass the status you believe the task is in, and if it moved underneath
you the write fails with a hinted CONFLICT ("stale write: expected … but task
is …") instead of silently clobbering another agent's transition. The precondition is
consumed, never stored — re-read, then retry.
task_update(task_id, { "status": "done" }) // → { ok:false, error:{ message: // 'cannot complete: acceptance criterion not met: "p95 under 200ms"' } } task_update(task_id, { "status": "done", "expected_status": "in_review" }) // the task moved underneath you → // { ok:false, error:{ code:"CONFLICT", // message:'stale write: expected status "in_review" but task is "done"' } }
The gates are opt-in per task. A task with none of the three behaves classically — but the floor never drops below the status machine itself: legal transitions only, no completing over open blockers, and reopening a finished task requires an explicit flag.
Idempotency as a safety property
Every write RPC opens by registering an idempotency key — inside the same
transaction as the write it protects, so there is no window where the work happened but the
key didn't stick. Replaying a used key returns the original result, flagged
idempotent_replay: true: no second row, no double side effects, and the caller
can tell a replay from a fresh write. Reusing a key with a different payload is
treated as a logic bug, not a retry — it fails with a 409-class CONFLICT
("idempotency key reuse with different payload"), because silently serving the old result
for new content would be a lie.
When the caller supplies no key, core derives one from the call's content — and deliberately folds in the calling actor and the ambient run key. This is the fix for a subtle phantom-success bug: two sibling runs forked from one parent, or two actors sharing one operator-pinned run id, can easily emit byte-identical writes. Without the fold they would collapse onto one row, and the loser would walk away holding the winner's id as a counterfeit success. With it, distinct siblings and distinct actors always get distinct keys; collapse only happens within one actor's one run, where it is genuinely a retry. When collapse is what you want — a cron tick ensuring "today's report task exists" — pass an explicit, date-stamped key.
The integrity gate: seven invariants
Verification of single writes is necessary but not sufficient — the system also has to
hold up as a whole, under real concurrent agents. The first of two adversarial
harnesses, harness/integrity-gate.mjs, asserts seven invariants against the
live sandbox backend by reading ground truth: database rows, not model
prose. It exits non-zero on any violation, independently of how pleasant the agent
experience scored — the design rule is that a race must fail the run, not get
"fixed" by tightening an input schema.
| # | Invariant | How it is checked |
|---|---|---|
| 1 | No duplicate seq per tenant | Every live task's sequence number read back and asserted unique |
| 2 | No state transition applied by two distinct actors | At most one transition author per task across the run's telemetry |
| 3 | Idempotent replay produced zero dups | An active probe: replay a used key, count the rows |
| 4 | No contended handoff left wrongly open | The deliberately contended handoff ended answered, not stranded |
| 5 | Append-only spines unedited | Checksums over decisions and activity, taken twice — byte-identical or fail |
| 6 | No events lost across a whoami window | The since-last-seen delta count is self-consistent |
| 7 | Dependency cycles rejected | A deliberate cycle attempt returned TASK_DEPENDENCY_CYCLE, not a write |
The gate's credential is itself a statement of posture: it runs through
@ledgenter/core with an ordinary sandbox actor key whose JWT is RLS-scoped to the
sandbox tenant — which is exactly the audit scope. No service-role key, no raw Postgres
connection. The --micro flag first runs a standalone contention scenario (two
identical transitions, an idempotency replay, a cycle attempt, a contended handoff) so the
gate can be validated cheaply, without a full multi-agent re-run.
The concurrency gate: thirteen adversarial probes
The agent-experience harness cannot exercise the nastiest conditions: its personas use
distinct keys in distinct processes, so shared-key siblings, overlapping ticks, a pinned
LEDGENTER_RUN_ID, or a same-key run_start race never occur naturally.
harness/concurrency-gate.mjs manufactures them deliberately — spawning
out-of-process children with surgically controlled environments — and asserts ground truth
in database rows and child stderr. Thirteen probes:
| The attack | What must hold |
|---|---|
16-wide same-key concurrent run_start | Exactly one run: one insert winner, fifteen reattach, no unique-violation leak |
16-wide distinct-key run_start racing one repo upsert | All sixteen resolve to a single repository row |
| Concurrent repo upserts plus a cross-host name collision | Local upserts converge to one row; same-named repos on different hosts stay distinct |
| Shared-key sibling forks writing identical content | N rows for N siblings — no phantom success |
| Identical sibling emissions, then a same-run replay | Both sibling rows land under their own run; the replay still dedupes |
| Same-run identical concurrent writes | Collapse is flagged (idempotent_replay or a visible retryable error) — never a silent same-id double-ok |
| Two actors writing under one pinned run key | Never collapse — the actor fold keeps their keys distinct |
Pinned LEDGENTER_RUN_ID, second fire after run_end | A fresh run is minted, with a stderr warning; the fires do not collapse |
| Pinned id plus a run group across two fires | Fresh per-tick keys, one series, tick 2's cursor seeded from tick 1 — a loop sees only since-last-tick |
| Overlapping sibling runs | Independent whoami cursors: one advancing never moves the other |
| A token-bearing remote URL and an absolute worktree path | The stored repo row is credential-free; the worktree path is basename-only |
| Real MCP server end-to-end, token in the environment | The transcript carries no provider token, and the first write's activity row already has a run id — lazy registration strictly precedes the write |
Real MCP server end-to-end, repo with a local_path | The path is served to the agent (it needs it) but dropped from telemetry (the log doesn't) |
Note what the last three probes are: leak attempts. Verification here is not only "did the data stay consistent" but "did a credential or a filesystem path escape into storage or logs" — checked by regex against the actual rows and the actual JSONL transcript, not by code review.
Teeth: a gate that cannot fail is decoration
A green checkmark only means something if the check is falsifiable. Both gates therefore
take --inject-violation N, which forces probe N to report red — and
the gate must then exit non-zero and render the run inadmissible. This is a unit-style proof
that the gate fails red: if an injected violation ever passed, the gate itself
would be the bug. Each gate also writes a machine-readable verdict
(reports/<run_id>.integrity.json,
.concurrency.json) and appends its section to the run's report; the exit code,
not the prose, is the contract.
node harness/integrity-gate.mjs --micro # contention micro-scenario, then the 7 invariants node harness/concurrency-gate.mjs # the 13 probes against the live sandbox node harness/integrity-gate.mjs --inject-violation 3 # ↑ must exit non-zero — proof the gate can fail
The CI ladder
The gates sit at the top of a ladder that runs from cheap static checks to live adversarial probes. Each rung catches what the rung below cannot:
every push ──► typecheck → lint → tests → build static, in order ──► RLS lint FORCE RLS + tenant policy, every public table ──► schema-drift gate generated types must match committed types PRs touching supabase / mcp ──► pgTAP isolation suite 145 assertions, real Postgres daily schedule ──► integrity gate · concurrency gate 7 invariants + 13 probes, LIVE backend
- Typecheck, lint, test, build — the base build job, run in that order on every push.
- RLS lint — asserts every public table carries forced row-level security and a tenant policy; service-only tables are explicitly allowlisted, never silently skipped (chapter 5).
- Schema-drift gate — types regenerated from the live schema must match the committed ones; a migration without regenerated types fails the build.
- pgTAP isolation suite — 145 assertions proving cross-tenant invisibility and raw-DML denial, gating every PR that touches the database or the MCP server.
- Scheduled adversarial gates — the integrity and concurrency gates run daily against the live sandbox, so a regression in race behavior surfaces within a day even if no PR touched that code path.
Why this much machinery? Because the failure modes it hunts are the quiet ones: a phantom success, a collapsed sibling write, a clobbered transition, a token in a log line. None of them throw. The only way to find them is to attack the real system and read the real rows — which is exactly what these gates do, every day.