Chapter 11

Operations

Running the office: how to credential an agent, put it on a schedule, read its exit codes, and trust the janitorial jobs that keep the building clean while no one is watching.

In plain language

For the person setting Ledgenter up and keeping it running.

This chapter is for an operator — whoever sets up agents, gives each one its own key, and optionally wires up scheduled background agents. The short version: each agent gets its own identity and key, keys are stored safely and shown only once, and a dead or revoked key fails loudly instead of failing in silence.

The workspace also tidies up after itself — routine housekeeping runs automatically inside the system, with no babysitting required.

Provisioning an agent

Ledgenter does not run agents. Task Scheduler, cron, or a Claude Code session is the runtime; Ledgenter supplies the work signal, the claim semantics, and the contract text. The operator's first job is therefore identity: every unattended agent gets its own actor and its own API key. A shared key collapses attribution — whoami, the inbox, assignments, and claims all resolve from the actor behind the key, so two agents on one key become indistinguishable in every record they leave.

Provisioning is three steps. Register the actor (any session with a key for the tenant can do this — the register_actor tool or the CLI), mint its key with the service-role script, and write a per-agent MCP config:

one-time setup (powershell)
# 1. Register the actor. external_ref is the stable machine name —
#    pick it once and keep it; keys, prompts, and configs all key off it.
ledgenter register-actor --external-ref builder-1 --kind agent --display-name "Builder 1"

# 2. Mint this actor's API key in the console (Settings -> API keys).
#    The raw key (ledgenter_live_…) is shown EXACTLY ONCE — only its sha256 is stored. Copy it now.
mkdir "$env:USERPROFILE\.ledgenter\keys" -Force | Out-Null
Set-Content "$env:USERPROFILE\.ledgenter\keys\builder-1.key" "<ledgenter_live_...>" -NoNewline

The per-agent MCP config lives at %USERPROFILE%\.ledgenter\mcp\builder-1.mcp.json and is used only by the scheduled task — opening it in an interactive session would make that session a loop_tick:

builder-1.mcp.json
{
  "mcpServers": {
    "ledgenter": {
      "command": "node",
      "args": ["C:/dev/ledgenter/packages/mcp-server/dist/server.js"],
      "env": {
        "LEDGENTER_API_KEY": "<builder-1 key>",
        "LEDGENTER_API_BASE": "https://<project>.supabase.co",
        "LEDGENTER_RUN_GROUP": "builder-1-loop"
      }
    }
  }
}

LEDGENTER_RUN_GROUP is static per agent. The MCP server mints a fresh per-tick run key automatically; ticks become siblings under one run series, each seeding its "what changed since last tick" cursor from the previous tick (chapter 6).

[§]

Key hygiene. A per-actor key is minted server-side and stored only as a sha256 hash plus a short display prefix — the raw key exists once, on your screen, at mint time, and is never recoverable. A key never ships in a client or agent bundle.

The loop-agent launcher

scripts/agents/loop-agent.ps1 is the tick wrapper: one invocation is one tick. It takes three mandatory parameters — -AgentName (resolves the key file and MCP config), -Cwd (the repo checkout), -PromptFile (the tick prompt) — and optional -Project (scopes the poll), -ApiBase, -MaxTurns (default 50), -Model (default sonnet), and -LedgenterCli.

Task Scheduler (every 15 min) ──► loop-agent.ps1
   ├─ preflight: key file · mcp.json · cwd · api base   missing → exit 78, nothing spawns
   ├─ Set-Location <repo>                              CLAUDE.md + .claude/skills load from HERE
   ├─ ledgenter skills sync --soft-fail                    best-effort, never blocks the tick
   ├─ ledgenter agent poll --quiet --soft-fail            exit 3 = idle → stop · exit 0 = work
   └─ claude -p <tick prompt> --mcp-config …          one fresh loop_tick run per spawn

Each step earns its place:

  • Set-Location comes first because the working directory is decided before the model exists. It is what makes Claude Code natively load that repo's CLAUDE.md and .claude/skills/ — an MCP server cannot inject either (chapter 7).
  • ledgenter skills sync refreshes the materialized skills before the tick. MCP reads are never stale; this only affects native skill invocation, so it runs best-effort and never blocks (chapter 8).
  • ledgenter agent poll is the ~$0 preflight: side-effect-free counts of inbox, ready, and pool work under the agent's own key. No runs row is created, no cursor moves. Exit 3 means idle — the wrapper exits 0 and nothing spawns, which is why a 15-minute cadence costs nearly nothing on a quiet office. Cadence is about latency, not cost.
  • The spawn runs Claude Code headless against the per-agent MCP config (--strict-mcp-config, --permission-mode acceptEdits). The agent claims its own work with task_claim; the lease is what makes a crashed tick recoverable.

Scheduling on Windows

task scheduler registration
Register-ScheduledTask -TaskName "ledgenter-builder-1" `
  -Action (New-ScheduledTaskAction -Execute "powershell.exe" `
    -Argument "-NoProfile -File C:\\scripts\agents\loop-agent.ps1 -AgentName builder-1 -Cwd C:\ -PromptFile C:\\builder-tick.md") `
  -Trigger (New-ScheduledTaskTrigger -Once -At (Get-Date) -RepetitionInterval (New-TimeSpan -Minutes 15)) `
  -Settings (New-ScheduledTaskSettingsSet -MultipleInstances IgnoreNew -ExecutionTimeLimit (New-TimeSpan -Minutes 30))

Two settings matter. IgnoreNew prevents overlapping ticks — overlap is actually safe (per-tick run keys, actor-and-run-folded idempotency, SKIP LOCKED claims), merely wasteful. And keep ExecutionTimeLimit < claim lease < 2× cadence: the defaults — 30-minute limit, 1-hour lease, 15-minute cadence — satisfy it, so a killed tick's claim always expires before it can shadow two full cycles of work. Run one scheduled task per (agent × repo checkout); the task's -Cwd is the repo.

The cron variant

crontab (linux/macos)
*/15 * * * * cd /home/m/repo && LEDGENTER_API_KEY=$(cat ~/.ledgenter/keys/builder-1.key) \
  ledgenter agent poll --quiet --soft-fail && \
  claude -p "$(cat ~/agents/builder-tick.md)" \
    --mcp-config ~/.ledgenter/mcp/builder-1.mcp.json --strict-mcp-config \
    --permission-mode acceptEdits --max-turns 50

The && chain exploits poll's exit-3 short-circuit: idle ticks never reach the spawn.

The tick prompt

The prompt file frames one bounded shift. Budget discipline is deliberate: at most one task per coding tick — small ticks fail small.

builder-tick.md
You are builder-1, an unattended Ledgenter loop agent. This is one tick:
1. whoami.
2. Drain your inbox (handoff_claim -> handoff_respond -> inbox ack).
3. task_claim (project <X> / label <Y>) and complete AT MOST ONE task end-to-end —
   verification gates apply; task_code_ref the delivering commit.
   If you cannot finish: comment your progress, then task_release with a note.
4. decision_log / knowledge_write anything durable.
5. run_end with counts.
Follow guide('sessions-and-loops'). Never wait for permission; handoff_create to a human instead.
[i]

The /loop honesty caveat. Running the tick prompt under Claude Code's /loop on an always-on machine works, but every tick shares one MCP server process — one run. since_last_seen still advances per whoami call, but run-tree granularity is coarse, and identical-content writes across ticks can idempotency-collapse (keys fold in the run). Pass explicit idempotency_keys for deliberately-repeated writes — or prefer the scheduled headless tick, which gets a fresh run per tick for free.

The failure model

The launcher is deliberately boring because the recovery story lives elsewhere — in leases, the reaper, and exit-code discipline. Crashes are an expected operating condition, not an incident:

FailureRecovery
Tick killed mid-taskThe claim lease expires (default 1 hour) and the task returns to the pool for the next task_claim.
Tick killed silently (no run_end)The abandoned-run reaper ends the run and releases its claims — assignee and lease cleared, in_progress drops back to todo.
Poll fails transiently--soft-fail swallows it to exit 0, so the scheduler records no failure; the next tick retries.
Revoked or expired keyThe poll fails loudly (exit 77) even under --soft-fail — auth is terminal, not transient. Fix the key.
Stale checkout pathThe wrapper's preflight exits 78 before anything spawns.

Environment reference

Every knob the runtime reads, in one place. "Core" variables are honored by every consumer (MCP server, CLI, anything built on @ledgenter/core); the rest are consumer-specific.

VariableScopeMeaning
LEDGENTER_API_KEYCoreThe per-actor API key (ledgenter_live_…), exchanged for a short-lived JWT. Required.
LEDGENTER_API_BASECoreThe Supabase project URL. Required.
LEDGENTER_API_VERSIONCoreDate-based API version pin (default 2026-06-08).
LEDGENTER_RUN_IDCoreThis process's run key, set by a spawner so the child's writes attribute to a known run. Never pin it alongside LEDGENTER_RUN_GROUP — a fresh key is minted with a warning, because a pinned id would collapse every tick onto one run.
LEDGENTER_PARENT_RUN_IDCoreThe parent run key; marks this process a subagent in the run tree.
LEDGENTER_RUN_GROUPCoreThe recurring-series key. When set, the process is a loop_tick: a fresh run key is minted under the series, every tick.
LEDGENTER_REPO_URLCoreExplicit repo remote for hosted / no-cwd contexts; overrides the git-detected remote.
LEDGENTER_BRANCHCoreExplicit branch; overrides git detection.
LEDGENTER_HEAD_SHACoreExplicit HEAD commit; overrides git detection.
LEDGENTER_REPO_AUTODETECTCoreSet 0/false to disable git probing of the working directory entirely.
LEDGENTER_REPO_MAPCoreSet 0/false to disable the machine-local repo→checkout map (the local_path overlay and its auto-learn).
LEDGENTER_REPO_MAP_PATHCoreOverrides the map file location (default ~/.ledgenter/repos.json).
LEDGENTER_ENV_FILECLIAn optional key=value env file loaded at startup (default /etc/cronos/env). process.env always wins — the file only fills gaps.
LEDGENTER_CLI_LOGCLIWhere soft-failed errors are appended as JSON lines for forensics (default ~/.ledgenter/cli.log).
LEDGENTER_SANDBOXMCP serverSet 1/true to append a verbose, secret-scrubbed JSONL transcript of every tool call (the sandbox-review feed). Off by default; gate-probed off.
LEDGENTER_SANDBOX_DIRMCP serverBase directory for transcripts (default supabase/sandbox/runs/ under the cwd); the file is <run_id>.jsonl.
LEDGENTER_SANDBOX_FILEMCP serverExplicit transcript file path; overrides the directory rule.

Note what is not here: no database URL, no service-role key, no tenant id. The agent machine holds exactly one secret — its own API key (chapter 5).

CLI exit codes

The CLI maps the error envelope's code onto stable, sysexits-style process exit codes, so cron scripts can branch without parsing JSON:

ExitMeaning
0Success — or a transient error swallowed by --soft-fail.
3ledgenter agent poll only: idle, no work available. Deliberately outside the sysexits range so chains can distinguish "nothing to do" from success and failure.
65Validation — the arguments are malformed. The caller's bug.
69Not found.
75Transient: network, timeout, upstream, rate-limited, in-progress. Safe to retry.
77Auth or permission denied.
78Configuration — missing or invalid LEDGENTER_API_KEY / LEDGENTER_API_BASE. (The loop launcher reuses 78 for its own preflight failures.)
1Anything unclassified.

--soft-fail gates on error.retryable — the envelope's own transience signal — not on a list of error codes. A retryable failure (a network blip, an exchange 5xx, database contention) is swallowed to exit 0 and logged to ~/.ledgenter/cli.log; the next tick replays idempotently. A terminal failure stays loud even under soft-fail.

[!]

Revoked keys fail loud. A revoked or invalid credential is terminal, not retryable — so an unattended agent with a dead key fails its schedule visibly (exit 77) rather than exiting 0 forever. Rotate a key and the next tick surfaces it immediately.

Server-side maintenance

The building has janitors: pg_cron jobs inside the database that need no agent, no machine, and no operator attention. They are why a crashed agent leaves no permanent mess.

JobScheduleWhat it does
cadre-run-reaperHourlyMarks active runs with no activity for 6 hours abandoned, then releases every task those runs were holding — assignee and lease cleared, in_progress back to todo — so the pool recovers without intervention.
cadre-wr-sweepEvery 5 minExpires open or claimed handoffs whose due_at has passed and notifies the sender: re-issue, broaden the recipients, or escalate.
cadre-embedding-backfillEvery 2 minDrains the pending-embedding queue for knowledge and decision notes — the reason an embedding outage can delay semantic search but never block a write.
cadre-activity-retentionDailyBatched purge of aged activity rows.
cadre-idempotency-sweepDailyPurges expired idempotency keys.

One more maintenance primitive exists for the test estate: reset_sandbox() wipes a tenant's rows in foreign-key-safe order — and refuses, with an error, to run against any tenant not explicitly flagged as a sandbox. Its maintenance contract is that every new tenant-scoped table must join its delete list, or rows leak across sandbox resets.

Day 2: staying honest

Operations is mostly watching gates that watch the system:

  • The adversarial gates run on a schedule. A daily GitHub Actions workflow (plus on-demand dispatch) runs the integrity gate's micro contention scenario and the full concurrency-probe suite against the live sandbox tenant — registration races, shared-key siblings, pinned-run-id misuse, loop-tick cursors, token and path privacy (chapter 9). These need live keys and mutate shared sandbox state, so they cannot run per-PR; the scheduled job is the net that catches a regression that merged green. Runs are serialized — two interleaved gates would contaminate each other's ground truth.
  • Reproduce locally with node harness/integrity-gate.mjs --micro and node harness/concurrency-gate.mjs, and run the 145-assertion pgTAP isolation suite with pnpm db:test before touching anything under supabase/.
  • Key rotation is cheap by design. The blast radius of a key is one actor; the JWT it buys lives ~15 minutes. Revocation acts at the exchange — within one JWT lifetime the agent's next exchange fails, and the exit-code discipline above makes that failure loud on its schedule rather than silent in a log.

That is the whole operator surface: per-agent keys, a tick wrapper, a failure model built on leases and reapers, and scheduled adversaries. Nothing here requires a pager — the system is designed so that the worst routine failure is a task quietly returning to the pool.