Spec: Session Manager

Purpose

This spec defines the session manager: the daemon subsystem that owns the lifecycle of sessions — from creation through active work to checkpoint, detach, reattach, and termination. It is the binding layer between the operator's intent (via the HTTP/WS API) and the daemon's execution machinery (primary agent, subagent supervisor, plugin host).

This document is normative for:

  • The session state machine — every state, every transition, every guard.
  • The PTY broker — how terminal sessions are multiplexed, attached, detached, resized, and persisted.
  • The checkpoint lifecycle — creation, inspection, resume, rollback, and the state they capture.
  • The reattach semantics — what happens when the operator reconnects after a disconnect.
  • The run model — what a "run" is, how runs relate to sessions, cancellation, and failure.
  • The persistence contract — what is durably stored, when, and in what shape.
  • The concurrency model — how many sessions, subagents, and runs can be active simultaneously.

It is not normative for:

  • The HTTP/WS API shape (that's http-api.md — the session manager implements the behavior behind those endpoints).
  • The sandbox or cage mechanism (that's sandbox.md).
  • The primary agent's dispatch logic (the session manager starts runs; the primary decides what to do within a run).
  • The plugin host (that's plugin-host.md).
  • The project DSL format (that's project-dsl.md).
  • The daemon's process-level lifecycle (that's daemon.md).

Constraints (from ADRs)

Constraint Source
Web-first; sessions survive operator disconnects ADR-0002
Runtime is Bun; PTY via node-pty or Bun's native PTY (if available) ADR-0004
Session state persisted to SQLite (or Postgres); durable across daemon restarts ADR-0005
Auth identity (user_id) threads through sessions for audit and multi-operator ADR-0007
Subagents spawned in cages; session manager delegates to sandbox subsystem ADR-0009
Both deployment modes use the same session semantics ADR-0010

Core concepts

Session

A session is a persistent, named context within a project. It is the unit of "what is currently happening" — an ongoing conversation between the operator and the primary agent, potentially with subagents running work in cages.

Key properties:

  • Project-scoped. Every session belongs to exactly one project.
  • Operator-scoped. Every session has a created_by user_id. In per-user mode, this is always the single operator. In system-wide mode, sessions are visible only to the operator who created them (v0; shared sessions are a v2 concern).
  • Survives disconnect. The operator closing their browser or losing network does not end the session. Work continues. The operator reattaches later.
  • Survives daemon restart. Session state is persisted continuously. A daemon SIGKILL loses only uncommitted in-flight reasoning (marked as a failed run on next startup).
  • Named with ULIDs. Session IDs are ULIDs (sortable, unique, no coordination needed). Displayed to the operator as the ULID string (e.g., 01HXAB...).

Run

A run is a single unit of model work within a session: the operator sends a message, the primary processes it (possibly dispatching subagents), and the work completes (or fails, or is cancelled). A session contains an ordered list of runs.

Key properties:

  • One active run per session at a time (v0). If a message arrives while a run is in progress, the API returns 409 conflict (the session is running — the conflict is per-session, not the concurrency throttle). The concurrency throttle operates across sessions: if the operator's or project's running-session count is at the limit, the message is accepted but the session enters queued with the run in pending.
  • Runs have their own state (pending, running, completed, failed, cancelled). A run in pending means the message is saved but the primary has not started — either because the session just queued it (concurrency throttle) or because the daemon is about to pick it up (normal brief gap between creation and dispatch).
  • Runs are the audit boundary. Every run records its input, output, subagent invocations, tool calls, and timing.
  • Runs can be cancelled. Cancellation kills in-flight subagents and returns control to the operator.

Checkpoint

A checkpoint is a first-class pause in a session. It captures a snapshot of the session's state at a moment in time, allowing the operator to inspect, edit prompts, and either resume or roll back.

Key properties:

  • Operator-initiated (the operator hits ⏸ in the UI) or model-initiated (the primary or a subagent requests one via an explicit tool call).
  • Inspectable. The checkpoint contains the message history up to that point, the active subagent states, the current tool calls, and the prompts in effect.
  • Editable. At a checkpoint, the operator can edit system prompts before resuming.
  • Reversible. The operator can roll back to a checkpoint, discarding work done after it.

Session state machine

                        create
                          │
                     ┌────▼────┐
                     │  idle   │◀──── run completes / resume with no pending work
                     └────┬────┘
                          │ operator posts message
                          │
                 ┌────────┴────────┐
                 │                 │
        under concurrency     at concurrency
              limit               limit
                 │                 │
            ┌────▼────┐      ┌────▼────┐
            │ running │      │ queued  │
            └──┬───┬──┘      └────┬────┘
               │   │              │ operator resumes
               │   │              │ (slot available)
               │   │         ┌────▼────┐
               │   │         │ running │
               │   │         └─────────┘
               │   │
 run completes │   │ checkpoint (operator or model)
               │   │
          ┌────▼┐ ┌▼────┐
          │idle │ │paused│
          └─────┘ └──┬──┘
                     │ resume / rollback+resume
                ┌────▼────┐
                │ running │
                └─────────┘

    Any state except ended ──── operator ends session ──── ┌───────┐
                                                           │ ended │
                                                           └───────┘

    Any state except ended ──── unrecoverable error ─────── ┌────────┐
                                                            │ failed │
                                                            └────────┘

States

State Description Operator can… Work happening?
idle Session exists, no active work. Awaiting operator input. Post a message, end the session, view history. No.
running Active run in progress. Primary and/or subagents are working. Watch output, request a checkpoint, cancel the run, end the session. Yes.
queued Message accepted but run not started — concurrency limit is reached. The message is persisted; the run is pending. Resume (when a slot frees), discard the queued message, end the session. No.
paused At a checkpoint. Work is suspended. Inspect state, edit prompts, resume, rollback, end the session. No (subagents suspended or exited).
ended Terminal. Session is complete. No further transitions. View history. No.
failed Terminal. Session hit an unrecoverable error. View history, view the failure details. No.

Transitions

From To Trigger Guard
(none) idle POST /api/v1/projects/:id/sessions Project state is ready. No concurrency limit at creation (see § Concurrency).
idle running POST /api/v1/sessions/:id/messages No active run; concurrency limit not reached.
idle queued POST /api/v1/sessions/:id/messages No active run; concurrency limit reached (per-project or per-operator). Message persisted, run created in pending.
queued running POST /api/v1/sessions/:id/resume Concurrency slot available; pending run is started.
queued idle DELETE /api/v1/sessions/:id/queued-message Operator discards the queued message. Pending run cancelled, message marked superseded.
running idle Run completes (success or failure of the run itself).
running paused Operator requests checkpoint or model requests checkpoint. At least one active run.
paused running POST /api/v1/sessions/:id/checkpoints/:cid/resume Checkpoint is the current pause point.
paused running POST /api/v1/sessions/:id/checkpoints/:cid/rollback + new message Rollback succeeds; operator posts a new message.
paused idle POST /api/v1/sessions/:id/checkpoints/:cid/resume with no pending work Resume determines nothing is left to do.
any except ended ended DELETE /api/v1/sessions/:id?confirm=true
any except ended failed Unrecoverable internal error (storage failure, daemon bug).

Transition side effects

Transition Side effects
idle (from running) Persist run result. Flush WebSocket buffers. Emit session.state event. Broadcast sessions.running_count on system socket.
idle (from queued) Cancel pending run (mark cancelled). Mark queued message superseded. Emit session.state event.
running (from idle) Create a new Run record. Start primary agent processing. Emit session.state event. Broadcast sessions.running_count on system socket.
running (from queued) Start the pending run (transition run pendingrunning). Start primary agent processing. Emit session.state event. Broadcast sessions.running_count on system socket.
queued Persist operator message. Create Run record in pending state. Emit session.state event with queued: true and reason: "concurrency_limit".
paused Suspend in-flight subagents (send SIGTSTP to caged processes). Capture checkpoint snapshot. Emit session.state event.
ended Kill all live subagents (SIGTERMSIGKILL). Persist final state. Close WebSocket with closing { reason: "session_ended" }. Emit session.state event. Broadcast sessions.running_count on system socket (if was running).
failed Kill all live subagents. Persist failure details. Close WebSocket. Emit session.state event. Broadcast sessions.running_count on system socket (if was running).

Run lifecycle

Run states

    create (message posted)
         │
    ┌────▼─────┐
    │ pending  │──── primary picks it up ────▶┌──────────┐
    └──────────┘                              │ running  │
                                              └──┬──┬──┬─┘
                                                 │  │  │
                              completes normally │  │  │ operator cancels
                                                 │  │  │
                                           ┌─────▼┐ │ ┌▼──────────┐
                                           │done  │ │ │ cancelled │
                                           └──────┘ │ └───────────┘
                                                     │
                                              error  │
                                                     │
                                              ┌──────▼─┐
                                              │ failed │
                                              └────────┘
State Description
pending Run created, message queued. Primary has not started processing.
running Primary is actively processing. Subagents may be spawned.
done Run completed successfully. Output is available.
failed Run encountered an error (LLM provider unreachable, subagent cage failure, etc.).
cancelled Operator cancelled the run mid-flight. In-flight subagents killed.

Run record

Each run persists:

interface Run {
  id: string;                    // ULID
  session_id: string;
  created_at: number;            // epoch ms
  completed_at: number | null;
  state: "pending" | "running" | "done" | "failed" | "cancelled";

  // Input
  operator_message: Message;     // the message that triggered this run

  // Output
  primary_response: Message | null;
  subagent_invocations: SubagentInvocation[];
  tool_calls: ToolCall[];

  // Failure
  error: RunError | null;        // present when state is "failed"

  // Timing
  duration_ms: number | null;
  tokens_in: number | null;
  tokens_out: number | null;
}

Cancellation

When the operator cancels a run (POST /api/v1/sessions/:id/runs/:rid/cancel):

  1. The session manager marks the run cancelled.
  2. Sends SIGTERM to all subagents spawned by this run.
  3. Waits 5s for subagents to exit.
  4. SIGKILL survivors.
  5. Cage cleanup (per sandbox.md teardown).
  6. The primary's in-flight LLM request is aborted (the HTTP connection to the provider is closed).
  7. The session transitions from runningidle.
  8. The partial output from the run is preserved (marked as cancelled, not deleted).

PTY broker

The PTY broker multiplexes terminal sessions over WebSocket. When a subagent runs a command that needs a terminal (shell, TUI, interactive tool), the broker allocates a PTY, connects it to the subagent's process inside the cage, and streams I/O to the operator's browser.

Architecture

operator's browser
      │
      │ WebSocket (pty channel)
      │
┌─────▼──────┐
│ PTY broker │──── manages ────▶ PTY instances (one per subagent terminal)
└─────┬──────┘
      │ pty fd
      │
┌─────▼──────────┐
│ subagent cage  │
│  └── process   │
│      (shell)   │
└────────────────┘

PTY allocation

PTYs are allocated when:

  1. A subagent's DSL configuration includes terminal: true (explicit).
  2. The primary dispatches a subagent with a tool call that requests terminal access.
  3. The operator explicitly requests a terminal for a running subagent via the UI.

PTY allocation uses node-pty (or Bun's native PTY API if/when available). The master side stays with the broker; the slave side is passed into the cage via file descriptor inheritance.

PTY lifecycle

allocate ──▶ active ──▶ detached ──▶ reattached ──▶ active
                │                                      │
                │                                      │
                ▼                                      ▼
             closed (subagent exits)              closed
State Description
active Operator is connected; I/O flows both ways.
detached Operator disconnected; PTY is alive; output buffered (scrollback).
closed Subagent exited or session ended. PTY fd closed. Transcript persisted.

I/O flow

Operator → PTY (input):

  1. WebSocket pty channel receives input frame: { type: "stdin", pty_id: "...", data: "<base64>" }.
  2. Broker writes decoded bytes to the PTY master fd.
  3. The shell (inside the cage) reads from the slave fd.

PTY → Operator (output):

  1. Broker reads from the PTY master fd.
  2. Frames output: { type: "stdout", pty_id: "...", data: "<base64>", seq: N }.
  3. Sends over the WebSocket pty channel.
  4. If the operator is disconnected (detached), output is buffered in the scrollback ring.

Resize

The operator's browser sends resize events:

{ "type": "resize", "pty_id": "...", "cols": 120, "rows": 40 }

The broker calls pty.resize(cols, rows), which sends SIGWINCH to the process group inside the cage.

Scrollback and transcript

Each PTY maintains:

  • Scrollback ring buffer: configurable, default 10,000 lines. Stored in memory while the PTY is active. On PTY close, the full scrollback is persisted to the database as the run's terminal transcript.
  • Transcript persistence: the daemon stores the transcript as a blob in the run_pty_transcripts table, keyed by (run_id, pty_id). The transcript includes raw bytes (for faithful replay) and a timestamp per chunk (for timed replay).

On reattach, the operator receives the scrollback from the buffer (not from the database — the database is for post-session replay).

PTY limits

Limit Default Enforced by
Max concurrent PTYs per session 8 Session manager (matches subagent limit per daemon.md)
Scrollback ring size 10,000 lines PTY broker
Max PTY output rate 1 MiB/sec to WebSocket PTY broker (drops frames with pty.output_throttled warning)
PTY idle timeout 30 minutes (no I/O) PTY broker (sends notification to operator; does not auto-close)

Checkpoint lifecycle

Checkpoint creation

Operator-initiated:

  1. Operator clicks ⏸ or sends POST /api/v1/sessions/:id/checkpoints.
  2. Session manager transitions session to paused.
  3. Suspends in-flight subagents (SIGTSTP to caged processes; they can be resumed).
  4. Captures the checkpoint snapshot (see below).
  5. Returns the checkpoint ID to the operator.

Model-initiated:

  1. The primary (or a subagent) calls the checkpoint tool: { "tool": "checkpoint", "reason": "Need clarification on deployment target" }.
  2. Session manager transitions session to paused.
  3. Same suspension and snapshot as operator-initiated.
  4. The checkpoint includes the model's stated reason, surfaced in the UI.

Compaction-initiated (reason: compaction_pending), per ADR-0024:

  1. The harness compaction pipeline runs the configured strategy. When strategy: checkpoint, the harness creates a checkpoint with reason: "compaction_pending" and ends the current run with finishReason: "awaiting_compaction".
  2. Session manager transitions session to paused.
  3. The checkpoint snapshot is captured normally plus a sibling CompactionRecord (see § CompactionRecord below) describing the proposed compaction.
  4. The Compactor UI (per ui/compactor.md) inspects the proposed compaction; the operator edits/approves/rejects.
  5. On approve, the operator-approved compaction is applied and the session resumes via the normal checkpoint resume path.
  6. On reject, the checkpoint is rolled back and the session continues without compaction (the operator must then either manually compact with a different strategy or manage the context-overflow themselves).

The reason: compaction_pending is a distinguished value; the UI surfaces it differently from operator/model-initiated checkpoints. The API endpoint for inspecting a compaction-pending checkpoint composes the checkpoint + the proposed CompactionRecord in one response (per http-api.md § Compaction).

Checkpoint snapshot

A checkpoint captures:

interface Checkpoint {
  id: string;                     // ULID
  session_id: string;
  run_id: string;                 // the run that was paused
  created_at: number;             // epoch ms
  created_by: "operator" | "model" | "compaction";   // ADR-0024 adds "compaction"
  reason: string | null;          // model's stated reason, operator's label, or "compaction_pending"
  compaction_id: string | null;   // when created_by == "compaction", references CompactionRecord

  // State snapshot
  message_history_cursor: string; // pointer to the last message at pause time
  active_subagents: SubagentSnapshot[];
  pending_tool_calls: ToolCall[]; // tool calls that were in-flight
  prompts_in_effect: PromptSnapshot[];

  // Metadata
  resumed_at: number | null;
  rolled_back: boolean;
  superseded_by: string | null;   // another checkpoint that replaced this one
}

interface SubagentSnapshot {
  name: string;
  state: "suspended" | "exited";  // suspended = SIGTSTP'd; exited = finished before pause
  cage_id: string | null;
  last_output: string;            // last 500 chars of output, for UI preview
  pty_id: string | null;
}

interface PromptSnapshot {
  target: "primary" | string;     // "primary" or subagent name
  content_hash: string;           // hash of the prompt content at pause time
  editable: boolean;              // always true in v0
}

Checkpoint inspection

At a checkpoint, the operator can:

  1. View the message history up to the pause point.
  2. View active subagent state — which subagents were running, their last output, their cage status.
  3. View pending tool calls — what the model was about to do.
  4. Edit system prompts — change the primary's or any subagent's system prompt. Edits are captured in the audit log.
  5. View the model's reason (if model-initiated) — "I need clarification on..."

Resume

POST /api/v1/sessions/:id/checkpoints/:cid/resume

  1. Validate the checkpoint is the current pause point (not superseded).
  2. Apply any prompt edits the operator made.
  3. Resume suspended subagents (SIGCONT to caged processes).
  4. Transition session to running.
  5. The primary continues from where it paused, with the (possibly edited) prompts.

If the operator edited prompts, the primary receives a system-level message: "System prompts were updated at this checkpoint. Previous prompt hash: X, new prompt hash: Y." This lets the model adjust its approach if the operator changed direction.

Rollback

POST /api/v1/sessions/:id/checkpoints/:cid/rollback

  1. Validate the checkpoint exists in this session.
  2. Kill all subagents spawned after the checkpoint (they're working with post-checkpoint context).
  3. Mark all messages after the checkpoint as superseded: true (they remain in history for auditability, but the UI greys them out).
  4. Transition session to idle (the operator can now post a new message that starts a fresh run from the checkpoint's state).
  5. The next run uses the checkpoint's prompt snapshots as its starting prompts.

Rollback does not delete data. The full history is preserved. Rollback is a pointer operation: it changes where the "current state" points to.

Checkpoint limits

Limit Default Rationale
Max checkpoints per session 50 Prevent unbounded storage growth. Oldest auto-archived (queryable, not shown in active list).
Checkpoint snapshot size ~100 KB typical Message cursor is a pointer, not a copy. Subagent snapshots are summaries.

Compaction lifecycle

Per ADR-0024, context compaction is owned by the harness (per agent.md § Compaction) and operates on the message-reconstruction path. The session manager's role is to persist CompactionRecord rows, transition the session state during strategy: checkpoint events, and surface compaction events to the operator.

CompactionRecord

A CompactionRecord is written for every compaction event (whether triggered automatically by threshold-crossing, by reactive-fallback after a context-length error, or by manual operator trigger).

interface CompactionRecord {
  id: string;                              // ULID
  session_id: string;
  run_id: string;                          // the run whose pre-call check triggered, or the run created by operator_manual
  agent_path: string;                      // which agent's window (per ADR-0022 canonical path)
  created_at: number;                      // epoch ms

  // Trigger and strategy
  trigger: "threshold_crossed" | "provider_overflow_retry" | "operator_manual" | "scheduled";
  strategy: "drop" | "summarize" | "delegate" | "checkpoint";

  // Estimates
  threshold_estimate: number;              // fraction of context window at trigger time (e.g. 0.91)
  after_estimate: number;                  // fraction after compaction (e.g. 0.55)
  window_upper: number;                    // configured upper threshold (e.g. 0.85)
  window_lower: number;                    // configured lower threshold (e.g. 0.60)

  // Message effect
  superseded_message_ids: string[];        // marked superseded by this event
  summary_message_id: string | null;       // new MessageRecord, if summarize/delegate produced one
  summary: string | null;                  // human-readable summary (audit log + UI)

  // Plugin participation
  plugins_fired: Array<{
    name: string;
    role: "observer" | "compactor";
    duration_ms: number;
    result_kind: "inject" | "retain" | "compactor_result" | "null" | "error";
  }>;

  // Cost
  plugin_cost: {
    provider: string;
    model: string;
    input_tokens: number;
    output_tokens: number;
    cost_usd: number;
  } | null;                                 // null when no model invoked (drop, observer-only delegate result)

  // Failure tracking
  fallback_occurred: boolean;              // true if delegate/summarize fell back to drop
  fallback_reason: string | null;          // human-readable reason

  // Operator feedback (per ADR-0024)
  operator_flag: "good" | "bad" | "neutral" | null;
  operator_notes: string | null;

  // Linkage
  checkpoint_id: string | null;            // when strategy == "checkpoint", links the paired Checkpoint
}

Storage

CompactionRecord rows live in the compactions table in kaged's SQLite (per ADR-0005) alongside checkpoints, messages, and sessions. Schema migration is in @kaged/storage; the migration adds the compactions table with indexes on session_id, run_id, and agent_path.

Lifecycle integration

  • Most strategies (drop / summarize / delegate) — the harness writes the CompactionRecord after the strategy completes and before the next LLM call. The session state does NOT change; the run continues with the compacted message list.
  • strategy: checkpoint — the harness writes the CompactionRecord with a placeholder summary (the proposed change), creates a paired Checkpoint with reason: "compaction_pending" and compaction_id: <CompactionRecord.id>, and ends the current run. The session transitions to paused per the checkpoint lifecycle. On checkpoint resume:
    • Approve → the operator-approved compaction is applied (the CompactionRecord is finalized with the actual superseded_message_ids and summary), the checkpoint marks resumed_at, the session transitions to running, and a new run is started with the compacted list.
    • Reject → the CompactionRecord is marked fallback_occurred: true, fallback_reason: "operator_rejected", the checkpoint is rolled back (existing rollback semantics), and the session transitions to idle. No messages are superseded. The operator must now either manually compact with a different strategy (per http-api.md) or manage the context-overflow themselves.

Operator feedback

The operator can attach a flag (good / bad / neutral) and free-text notes to any CompactionRecord after the fact via PATCH /api/v1/sessions/:id/compactions/:cid (per http-api.md). This persists on the CompactionRecord and surfaces in the Compactor UI's history view. Plugin authors iterating on a compactor plugin use these flags as a feedback signal.

Interaction with rollback

If the operator rolls back past a compaction point (rollback target is a checkpoint that predates the compaction):

  1. The standard rollback path runs (per § Rollback).
  2. Messages marked superseded = true by the compaction (its superseded_message_ids) are unsuperseded — they return to active state in the message list.
  3. Any summary_message_id created by the compaction is superseded by the rollback (it was a post-compaction artifact, not present at the rollback target).
  4. The CompactionRecord itself is NOT deleted — it remains in the audit log marked rolled_back: true (a new field on CompactionRecord).

This is the same machinery as message rollback per ADR-0022 / session-manager; compaction integrates with no new flag.

Compaction limits

Limit Default Rationale
Max CompactionRecord rows per session 1000 Long sessions can accrue many compactions; old rows summarize before being purged. Oldest 100 are aggregated into a single "compactions before T" summary row when the limit is hit.
Compaction event timeout 60s A strategy that takes longer than 60s is treated as failed and falls back to drop.

Reattach semantics

Sessions survive operator disconnects. The reattach flow is designed for the common case: "operator's phone dropped Wi-Fi mid-task."

Disconnect detection

  1. WebSocket close frame received — clean disconnect. Immediate.
  2. WebSocket ping/pong timeout — unclean disconnect (network loss). Detected within 30s (ping interval is 15s, 2 missed pongs = disconnect).
  3. HTTP idle timeout — operator hasn't made any request in 10 minutes. Not a disconnect per se; the session remains attached but the buffer deadline starts.

During disconnect

The daemon:

  1. Keeps the session and all its in-flight work alive. Work does not stop because the operator disconnected.
  2. Buffers WebSocket frames that would have been sent to the operator:
    • Output channel: buffered up to 10 minutes or 50 MiB (whichever comes first).
    • Events channel: buffered with the same limits.
    • PTY channel: NOT buffered for replay (PTY scrollback is in the PTY ring buffer; see PTY broker).
    • Control channel: not buffered (control messages are request/response, not streaming).
  3. Records the disconnect in the audit log: session.detached { session_id, reason, user_id }.

Reattach

  1. Operator opens a new WebSocket to /api/v1/sessions/:id/socket.
  2. Sends control { type: "hello", payload: { resume_from_seq: { output: N, events: M } } }.
  3. Daemon checks the buffer:
    • Buffer has the requested sequences: replays missed frames from N+1 onward for output, M+1 for events. Sends welcome { session_id, server_seq: { ... } } first.
    • Buffer gap (too old or overflowed): sends closing { code: "resume_failed" }. Client must do a full re-fetch via GET /api/v1/sessions/:id/messages?since=<last-seen-message-id>.
  4. PTY reattach: the broker sends the current scrollback content for any active PTYs as a burst of pty.stdout frames. This is the PTY ring buffer content, not a replay of every byte since disconnect.
  5. Session is now attached again. Normal streaming resumes.

Multi-device

v0 allows one operator connection per session at a time (per http-api.md). A second WebSocket upgrade attempt while attached returns 409 conflict.

To move between devices:

  1. Close the WebSocket on the old device (or wait for idle timeout).
  2. Open on the new device.

v1 may add read-only observer connections (second device can watch but not interact). Not v0.


Persistence

What is persisted and when

Data Persisted when Storage
Session record (id, project, state, created_by, timestamps) On creation, on every state transition sessions table
Messages (operator and model) Immediately on receipt/generation messages table
Run records On creation, on completion/failure/cancel runs table
Subagent invocations On spawn, on exit subagent_invocations table
Tool calls and results On call, on result tool_calls table
Checkpoints On creation checkpoints table
PTY transcripts On PTY close run_pty_transcripts table
Prompt edits (at checkpoints) On edit prompt_edits table

Persistence guarantees

  • No data loss on clean shutdown. The daemon flushes all pending writes before exiting.
  • Minimal data loss on SIGKILL. SQLite WAL mode ensures committed transactions survive. The only loss is the current in-flight LLM response (which is streamed and not yet committed). On next startup, the session manager:
    1. Scans for sessions in running state.
    2. Checks if the corresponding run has a committed response.
    3. If not, marks the run as failed with error: "daemon_crash_during_run".
    4. Transitions the session to idle.
    5. Logs session.crash_recovered in the audit log.

Schema sketch

CREATE TABLE sessions (
  id            TEXT PRIMARY KEY,   -- ULID
  project_id    TEXT NOT NULL,
  created_by    TEXT NOT NULL,      -- user_id
  state         TEXT NOT NULL,      -- idle, running, queued, paused, ended, failed
  created_at    INTEGER NOT NULL,   -- epoch ms
  updated_at    INTEGER NOT NULL,
  ended_at      INTEGER,
  forked_from   TEXT,               -- session ID if this was forked
  model_override TEXT,              -- "provider:model" — overrides DSL primary.model for this session
  bound_issue   TEXT,               -- ADR-0034: FK to issues.id — session's bound issue (nullable)
  FOREIGN KEY (project_id) REFERENCES projects(id)
);

CREATE TABLE messages (
  id            TEXT PRIMARY KEY,   -- ULID
  session_id    TEXT NOT NULL,
  run_id        TEXT,               -- null for the initial operator message before a run starts
  role          TEXT NOT NULL,      -- operator, primary, subagent, system
  content       TEXT NOT NULL,
  created_at    INTEGER NOT NULL,
  superseded    INTEGER DEFAULT 0,  -- 1 if rolled back past
  metadata      TEXT,               -- JSON: token counts, model used, etc.
  FOREIGN KEY (session_id) REFERENCES sessions(id)
);

CREATE TABLE runs (
  id            TEXT PRIMARY KEY,   -- ULID
  session_id    TEXT NOT NULL,
  state         TEXT NOT NULL,      -- pending, running, done, failed, cancelled
  created_at    INTEGER NOT NULL,
  completed_at  INTEGER,
  duration_ms   INTEGER,
  tokens_in     INTEGER,
  tokens_out    INTEGER,
  error         TEXT,               -- JSON: RunError, null if no error
  FOREIGN KEY (session_id) REFERENCES sessions(id)
);

CREATE TABLE subagent_invocations (
  id            TEXT PRIMARY KEY,   -- ULID
  run_id        TEXT NOT NULL,
  session_id    TEXT NOT NULL,
  subagent_name TEXT NOT NULL,
  cage_id       TEXT,               -- null if uncaged
  state         TEXT NOT NULL,      -- spawned, running, done, failed, killed
  spawned_at    INTEGER NOT NULL,
  exited_at     INTEGER,
  exit_code     INTEGER,
  FOREIGN KEY (run_id) REFERENCES runs(id)
);

CREATE TABLE tool_calls (
  id            TEXT PRIMARY KEY,   -- ULID
  run_id        TEXT NOT NULL,
  session_id    TEXT NOT NULL,
  caller        TEXT NOT NULL,      -- "primary" or subagent name
  tool_name     TEXT NOT NULL,
  input         TEXT NOT NULL,      -- JSON
  output        TEXT,               -- JSON, null if pending
  state         TEXT NOT NULL,      -- pending, done, failed
  created_at    INTEGER NOT NULL,
  completed_at  INTEGER,
  FOREIGN KEY (run_id) REFERENCES runs(id)
);

CREATE TABLE checkpoints (
  id            TEXT PRIMARY KEY,   -- ULID
  session_id    TEXT NOT NULL,
  run_id        TEXT NOT NULL,
  created_at    INTEGER NOT NULL,
  created_by    TEXT NOT NULL,      -- "operator" or "model"
  reason        TEXT,
  snapshot      TEXT NOT NULL,       -- JSON: CheckpointSnapshot
  resumed_at    INTEGER,
  rolled_back   INTEGER DEFAULT 0,
  superseded_by TEXT,
  FOREIGN KEY (session_id) REFERENCES sessions(id)
);

CREATE TABLE run_pty_transcripts (
  id            TEXT PRIMARY KEY,
  run_id        TEXT NOT NULL,
  pty_id        TEXT NOT NULL,
  transcript    BLOB NOT NULL,      -- raw bytes with timing markers
  line_count    INTEGER NOT NULL,
  created_at    INTEGER NOT NULL,
  FOREIGN KEY (run_id) REFERENCES runs(id)
);

CREATE TABLE prompt_edits (
  id            TEXT PRIMARY KEY,   -- ULID
  checkpoint_id TEXT NOT NULL,
  session_id    TEXT NOT NULL,
  target        TEXT NOT NULL,      -- "primary" or subagent name
  old_hash      TEXT NOT NULL,
  new_hash      TEXT NOT NULL,
  new_content   TEXT NOT NULL,      -- the edited prompt text
  edited_at     INTEGER NOT NULL,
  FOREIGN KEY (checkpoint_id) REFERENCES checkpoints(id)
);

-- Indexes
CREATE INDEX idx_messages_session ON messages(session_id, created_at);
CREATE INDEX idx_runs_session ON runs(session_id, created_at);
CREATE INDEX idx_subagent_invocations_run ON subagent_invocations(run_id);
CREATE INDEX idx_tool_calls_run ON tool_calls(run_id);
CREATE INDEX idx_checkpoints_session ON checkpoints(session_id, created_at);

All tables use TEXT primary keys (ULIDs). Timestamps are integer epoch milliseconds (per ADR-0005 implementation notes). JSON columns use TEXT with application-level validation.


Session forking

An operator can create a new session forked from an existing one:

POST /api/v1/projects/music/sessions
{
  "resume_from": "01HXAB..."
}

Forking:

  1. Creates a new session with forked_from = <source_session_id>.
  2. Copies the message history from the source session up to the current point (or up to a specified checkpoint).
  3. Does NOT share state — the fork is a snapshot copy. Changes to the original don't affect the fork.
  4. The new session starts in idle state.

Forking is a relatively expensive operation (message copy). It is not a branching model like git — there's no merge. Forks are independent sessions that happen to share an origin.


Session-issue binding (ADR-0034)

Per ADR-0034, a session binds to at most one issue. The agent's working checklist is stored as issue-owned todos, mutated through the root-only kaged.todo tool that operates implicitly on the bound issue.

Binding semantics

  • Session-side pointer. bound_issue is a nullable column on the sessions table. It points at an issues.id row. An issue may be worked by many sessions over its lifetime, one active at a time.
  • Binding is not a state transition. Binding and unbinding from the UI/sidebar attaches or detaches the pointer; it does not mutate, clear, or complete stored todos. Unbinding simply puts the existing (possibly incomplete) todos out of reach of the current session. Re-binding the same issue rehydrates them.
  • kaged.todo requires a binding. With no binding, the tool resolves to an error that tells the agent to ask the operator to bind an issue. This is deliberate: the issue is the storage, so no issue means no todos.

Bind/unbind endpoints

Endpoint Method Description
/api/v1/sessions/:id/bind PUT Bind an issue to the session. Body: { "issue_id": "<ulid>" }. Validates the issue exists.
/api/v1/sessions/:id/bind DELETE Unbind the issue from the session. Sets bound_issue to null.

Session creation with binding

POST /api/v1/projects/:id/sessions accepts an optional bound_issue field in the body. When present, the session is created with the binding already set. This is the "send to agent" path from the issue detail screen.

Auto-conclude answer

Per ADR-0034, ending a session never silently resolves its bound issue. The operator confirms closure at the closing checkpoint. When the agent has marked the last open criterion done, it calls kaged.checkpoint with a reason like "acceptance criteria met, requesting sign-off." The session pauses. The operator reviews, then resumes; the resume path transitions the issue to resolved (via kaged.issue transition or the UI) and the session may conclude.

This resolves the standing open question: manual session termination does not auto-resolve an assigned issue. The operator confirms, at the closing checkpoint.


Concurrency

Session-level concurrency

Resource Limit Enforced when Enforced by
Max concurrent running sessions per project 4 Message send (post_message) Session manager
Max concurrent running sessions per operator (across projects) 16 Message send (post_message) Session manager
Max concurrent runs per session 1 Message send (post_message) Session manager (v0; sequential runs only)
Max concurrent subagents per run 8 Subagent dispatch Subagent supervisor (per daemon.md)
Max concurrent PTYs per session 8 PTY allocation PTY broker

Concurrency is checked at message-send time, not session-creation time. Session creation always succeeds (as long as the project is ready). The running-session count is checked when the operator posts a message:

  • Under the limit: the run starts immediately. Session transitions idlerunning.
  • At the limit: the message is persisted, a run is created in pending, and the session transitions idlequeued. The operator sees a pause indicator and can resume when a slot frees.

No auto-dequeue. When a running session ends and frees a slot, the daemon broadcasts the updated count via the system WebSocket (sessions.running_count) but does not automatically start queued sessions. The operator explicitly resumes each queued session via POST /api/v1/sessions/:id/resume. This is deliberate: the operator decides which queued session gets the freed slot, not the daemon.

What counts as "running": only sessions in the running state. Sessions in idle, queued, paused, ended, or failed do not count against the concurrency limit. A paused session has suspended its run (checkpoint) and released the slot; a queued session never started its run.

Exceeding a per-session limit (one run per session — posting while running) returns 409 conflict. Exceeding a cross-session concurrency limit (per-project or per-operator) does not return an error — the message is queued.

Locking

  • Session lock: each session has an in-memory mutex. All state transitions acquire the lock. This prevents races between "operator cancels run" and "run completes naturally" arriving simultaneously.
  • No cross-session locks. Sessions are independent. A deadlock between sessions is architecturally impossible.
  • Database writes are serialized by SQLite's write lock (or Postgres' transaction isolation). The session manager does not add its own database-level locking beyond the engine's guarantees.

DSL hot-reload interaction

Per daemon.md:

  • DSL changes are applied at next session-start, not to active sessions.
  • Active sessions continue with the DSL they were started under.
  • The UI shows a "DSL changed; new sessions will use the updated config" indicator.
  • The operator can use "restart with new DSL" in the session UI, which ends the current session and creates a new one with the updated DSL.

The session manager stores the DSL version (a hash of the project.yaml content) at session creation time. This is used to:

  1. Detect stale sessions (DSL changed since session started).
  2. Replay/audit: know which DSL was in effect during a session.

Audit events

Event When Data
session.created New session session_id, project_id, user_id, forked_from
session.queued Message accepted but throttled by concurrency limit session_id, run_id, message_id, reason (per_project or per_operator), running_count, limit
session.resumed_from_queue Operator resumed a queued session session_id, run_id, user_id, running_count
session.queued_discarded Operator discarded a queued message session_id, run_id, message_id, user_id
session.state State transition session_id, from_state, to_state, trigger
session.attached Operator connected via WebSocket session_id, user_id, device_hint
session.detached Operator disconnected session_id, user_id, reason (clean/timeout/error)
session.ended Session ended session_id, user_id, run_count, duration
session.crash_recovered Daemon restarted; session was in running session_id, failed_run_id
run.created New run started run_id, session_id, message_preview
run.completed Run finished run_id, state, duration_ms, tokens
run.cancelled Operator cancelled a run run_id, session_id, user_id
checkpoint.created Checkpoint taken checkpoint_id, session_id, created_by, reason
checkpoint.resumed Checkpoint resumed checkpoint_id, session_id, prompts_edited
checkpoint.rolled_back Rollback to checkpoint checkpoint_id, session_id, messages_superseded
prompt.edited Prompt changed at checkpoint checkpoint_id, target, old_hash, new_hash
pty.allocated PTY created for a subagent pty_id, session_id, subagent_name
pty.closed PTY closed pty_id, reason, transcript_lines
pty.reattached PTY reconnected after disconnect pty_id, scrollback_lines_sent

Failure modes

Failure Detection Recovery Operator impact
LLM provider unreachable during run HTTP timeout / connection refused Run marked failed. Session → idle. Operator retries. Message "Provider unreachable. Try again or check provider config."
Subagent cage fails to spawn Sandbox compiler error or bwrap exec failure Run marked failed with cage error details. Session → idle. Error details shown; operator checks DSL cage config.
Subagent exceeds walltime Cage walltime watchdog Subagent killed. Run continues with partial output if other subagents are active; otherwise run marked failed. Warning: "Subagent X exceeded walltime limit."
Subagent OOM cgroup OOM killer Same as walltime. Warning: "Subagent X exceeded memory limit."
Database write failure SQLite/Postgres error Session transitions to failed. Daemon logs critical error. Session lost. Operator creates a new session.
Daemon SIGKILL during run Next startup scan Run marked failed. Session → idle. "Previous run was interrupted by a daemon restart."
WebSocket disconnect mid-run Ping/pong timeout Work continues. Frames buffered. Operator reattaches. Seamless if reattach within buffer window; full re-fetch otherwise.
Checkpoint resume with edited prompts fails Primary rejects the prompt Run starts but immediately fails. Session → idle. "Prompt edit caused an error. Review the edited prompt."
Fork source session not found Database lookup miss Fork request returns 404. "Source session not found."

Testing notes

State machine tests

  • Every transition: assert the guard condition, the side effects, and the resulting state.
  • Invalid transitions: assert rejection (e.g., endedrunning returns error).
  • Concurrent transitions: two threads try to transition the same session simultaneously. Assert one wins, one gets a conflict.
  • Crash recovery: simulate daemon kill during running. Assert next startup marks run as failed and session as idle.

Concurrency throttle tests

  • Under limit: post message with running count below limit. Assert session → running, run starts.
  • At per-project limit: create 4 running sessions in a project, post a message on a 5th. Assert session → queued, run is pending, message is persisted.
  • At per-operator limit: create 16 running sessions across projects, post a message on a 17th. Assert session → queued.
  • Resume from queued: session is queued, slot frees, operator calls resume. Assert session → running, pending run starts.
  • Resume when still at limit: session is queued, no slot freed. Operator calls resume. Assert 409 conflict (slot not available).
  • Discard queued message: session is queued, operator discards. Assert session → idle, run marked cancelled, message marked superseded.
  • Paused session frees slot: session A is running, session B is queued. Checkpoint session A. Assert session A → paused, running count decrements. Operator can now resume session B.
  • Session creation ignores concurrency: create 5 sessions in a project (limit is 4). Assert all 5 succeed with idle state. Concurrency is not checked at creation.
  • System count broadcast: start/end a run. Assert sessions.running_count event fires on the system socket with correct counts.

PTY tests

  • Allocation: request a PTY for a subagent. Assert the PTY is created and I/O flows.
  • Resize: send resize event. Assert SIGWINCH reaches the process.
  • Detach/reattach: disconnect the WebSocket. Assert PTY stays alive. Reconnect. Assert scrollback is delivered.
  • Transcript persistence: close a PTY. Assert the transcript is in the database. Fetch via API. Assert content matches.
  • Rate limiting: flood the PTY output channel. Assert frames are dropped with a warning, not that the daemon OOMs.

Checkpoint tests

  • Create (operator): request checkpoint during a run. Assert session pauses, subagents suspended.
  • Create (model): mock a primary that calls the checkpoint tool. Assert same behavior.
  • Resume: resume a checkpoint. Assert subagents get SIGCONT. Assert session → running.
  • Resume with prompt edit: edit a prompt at a checkpoint, resume. Assert the primary sees the edit notification.
  • Rollback: rollback to a checkpoint. Assert later messages are marked superseded. Assert next run starts from the checkpoint state.
  • Rollback to old checkpoint: rollback to a checkpoint that is not the most recent. Assert intermediate checkpoints are superseded.

Persistence tests

  • Message ordering: post 100 messages. Assert they come back in order.
  • Run recording: complete a run. Assert all fields are persisted.
  • Session survives restart: create a session, restart the daemon. Assert session is still there with correct state.
  • Fork: fork a session. Assert message history is copied. Modify original. Assert fork is unaffected.

Reattach tests

  • Clean reattach within buffer window: disconnect, reconnect within 10 minutes. Assert missed frames are replayed.
  • Reattach after buffer overflow: disconnect, wait for buffer to overflow. Assert resume_failed and client falls back to HTTP re-fetch.
  • Multi-device conflict: connect from device A, try to connect from device B. Assert 409.

Open questions

  1. Parallel runs. v0 is one run per session. Should v0.x allow queuing multiple messages (the primary processes them sequentially) or true parallel runs (multiple primaries)? Sequential queuing seems low-cost; true parallel is a big complexity bump.
  2. Session sharing between operators. v0 is single-operator sessions. v1 may want shared sessions (one operator watches another's session in read-only). Needs auth model extension.
  3. Session archival. Old sessions accumulate. When do they get archived? v0: never automatically. Operator uses DELETE to end them. A future kaged session archive --older-than 30d is plausible.
  4. Subagent suspension fidelity. SIGTSTP to a caged process works for simple commands but may not work for all programs (some ignore TSTP, some corrupt state). v0 documents the limitation; v0.x may add a "snapshot and kill" checkpoint mode for subagents that can't be suspended.
  5. PTY replay fidelity. Timed replay of PTY transcripts is useful for debugging but the timing markers add storage overhead. v0 stores timestamps per chunk (every ~100ms batch); finer granularity deferred.
  6. Session memory limit. A session with thousands of messages may exhaust the context window of the primary model. The session manager does not manage context windows (that's the primary's job), but should it enforce a message-count limit? v0: no limit. The primary is responsible for its own context management.

Amendments

2026-05-23 — Per-session model override

Added model_override column to the sessions table and the corresponding modelOverride field to SessionRecord. When set, it is a "provider:model" string that overrides the DSL's primary.model alias for all runs in this session. The daemon's dispatch path checks session.modelOverride before alias resolution — if set, it splits the override into provider + model, resolves the provider's credentials from local config, and constructs the ProviderRoute directly, bypassing alias lookup entirely.

The override is a session-level persistent setting, not per-message. It is set via PUT /api/v1/sessions/:id (see http-api.md) and can be cleared by setting it to null. The UI exposes a model picker in the session input area that reads from the operator's configured providers and their persisted model catalogs.

POST /api/v1/sessions/:id/messages also accepts an optional model_override field in the request body. When present, it is persisted to the session record before dispatch, becoming the session's override for this and all subsequent messages (until changed or cleared). This enables per-message model switching while maintaining the "session-level override" semantic — the last-used model is sticky.

2026-05-27 — ADR-0024: compaction-pending checkpoints and CompactionRecord

Per ADR-0024:

  1. Checkpoint created_by extended. New value "compaction" added alongside "operator" and "model". Distinguished by the harness's compaction pipeline when strategy: checkpoint is configured. New compaction_id field on Checkpoint links to the paired CompactionRecord.
  2. reason: "compaction_pending" is a distinguished value indicating the checkpoint was created by a strategy: checkpoint compaction event awaiting operator review. The Compactor UI (per ui/compactor.md) surfaces these checkpoints differently.
  3. New CompactionRecord shape defined. Persisted in the compactions SQLite table. Fields include trigger, strategy, threshold estimates, message effect (superseded IDs, summary message), plugin participation, cost, fallback tracking, operator feedback (operator_flag, operator_notes), and checkpoint_id linkage. Schema migration in @kaged/storage adds the table with indexes on session_id, run_id, agent_path.
  4. Compaction lifecycle section added documenting how compaction integrates with the session manager: most strategies don't change session state, but strategy: checkpoint creates a paired checkpoint and pauses the session. Approve / reject flows specified.
  5. Operator feedback persistence. PATCH /api/v1/sessions/:id/compactions/:cid (per http-api.md) writes operator_flag and operator_notes to the CompactionRecord. Used by the Compactor UI's history view.
  6. Rollback interaction. Rolling back past a compaction unsuperseses the affected superseded_message_ids. The summary message (if any) is superseded by the rollback. The CompactionRecord is preserved with rolled_back: true (a new field).
  7. Compaction limits added — 1000 CompactionRecord rows per session before aggregation; 60s timeout per compaction event with fallback to drop.
  8. Constrained-by list extended with ADR-0024.

2026-06-05 — ADR-0034: session-issue binding, bind/unbind transitions, auto-conclude answer

Per ADR-0034:

  1. bound_issue column added to the sessions table schema. Nullable FK to issues.id. Stored on SessionRecord as boundIssue: string | null.
  2. Session-issue binding section added. Documents binding semantics: session-side pointer, not a state transition; binding/unbinding does not mutate todos; kaged.todo requires a binding.
  3. Bind/unbind endpoints documented. PUT /api/v1/sessions/:id/bind and DELETE /api/v1/sessions/:id/bind.
  4. Session creation with binding. POST /api/v1/projects/:id/sessions accepts optional bound_issue field.
  5. Auto-conclude answer. Ending a session never silently resolves its bound issue. The operator confirms at the closing checkpoint. Resolves the standing open question.
  6. Constrained-by list extended with ADR-0034.

2026-06-05 — Concurrency throttle at message-send: queued state, system-wide running count, operator-explicit resume

The concurrency model is restructured: limits are checked at message-send time, not session-creation time, and exceeded messages queue rather than reject.

  1. New session state: queued. When the operator posts a message and the running-session count (per-project or per-operator) is at the limit, the session enters queued instead of rejecting the message. The message is persisted, a run is created in pending, and no primary dispatch occurs.
  2. Concurrency limits moved from creation to message-send. checkSessionCreationLimits is removed from session creation. Session creation always succeeds (project must be ready). The per-project (4) and per-operator (16) limits are checked in post_message instead.
  3. Operator-explicit resume. The daemon does not auto-dequeue queued sessions when slots free. The operator calls POST /api/v1/sessions/:id/resume to start the pending run. This gives the operator control over which queued session gets a freed slot.
  4. Queued message discard. The operator can discard a queued message via DELETE /api/v1/sessions/:id/queued-message, which cancels the pending run, marks the message superseded, and transitions the session back to idle.
  5. System-wide running count broadcast. Every run start/end broadcasts sessions.running_count on the system WebSocket (per http-api.md § System WebSocket). The count includes per-project and per-operator breakdowns so the UI can show both the global count and whether a specific session is near its limit.
  6. paused sessions release the slot. A checkpoint-paused session transitions runningpaused and no longer counts against the concurrency limit. This is consistent with the existing checkpoint semantics (work is suspended) and means pausing a session can free a slot for a queued one.
  7. State machine diagram, states table, transitions table, transition side effects, and audit events updated. Three new audit events: session.queued, session.resumed_from_queue, session.queued_discarded.
  8. "What counts as running" clarified. Only running sessions count against the limit. idle, queued, paused, ended, and failed do not.

2026-06-05: Sticky todo reminder injection in primary runner

  • The primary runner's message pipeline now includes a sticky todo reminder injection point. After spend-limit enforcement and before the LLM call, the runner resolves the session's bound_issue, loads its todos, and conditionally appends an ephemeral system-role message listing open items. The message is not persisted, existing only for that single LLM call. Suppression logic prevents redundancy when a kaged.todo tool result is already in the preceding turn. See agent-tooling.md § kaged.todo for the full specification.

References

  • ADR-0002 — web-first, sessions survive disconnects
  • ADR-0004 — Bun runtime, PTY integration
  • ADR-0005 — storage engine, schema portability
  • ADR-0007 — identity threading through sessions
  • ADR-0009 — cage mechanism for subagents within sessions
  • ADR-0010 — deployment modes, same session semantics
  • ADR-0024 — context compaction; CompactionRecord and compaction_pending checkpoint
  • agent.md § Compaction — the harness-side compaction pipeline this section integrates with
  • http-api.md — the API surface the session manager implements
  • daemon.md — process model, supervisor, operating limits
  • sandbox.md — cage spawn/teardown called by the session manager
  • plugin-host.md — plugins called during runs
  • project-dsl.md — DSL that defines session shape (subagents, cages, prompts)
  • local-config.md — alias resolution at session-start