Spec: Session Manager

Status: Draft
Last amended: 2026-07-01 (continue-suppression at tool boundaries for auto-continue)
Constrained by: ADR-0002, ADR-0004, ADR-0005, ADR-0007, ADR-0009, ADR-0010, ADR-0024, ADR-0034
Implements: packages/session-manager/ (planned)

Purpose

This spec defines the session manager: the daemon subsystem that owns the lifecycle of sessions — from creation through active work to checkpoint, detach, reattach, and termination. It is the binding layer between the operator's intent (via the HTTP/WS API) and the daemon's execution machinery (primary agent, subagent supervisor, plugin host).

This document is normative for:

The session state machine — every state, every transition, every guard.
The PTY broker — how terminal sessions are multiplexed, attached, detached, resized, and persisted.
The checkpoint lifecycle — creation, inspection, resume, rollback, and the state they capture.
The reattach semantics — what happens when the operator reconnects after a disconnect.
The run model — what a "run" is, how runs relate to sessions, cancellation, and failure.
The persistence contract — what is durably stored, when, and in what shape.
The concurrency model — how many sessions, subagents, and runs can be active simultaneously.

It is not normative for:

The HTTP/WS API shape (that's http-api.md — the session manager implements the behavior behind those endpoints).
The sandbox or cage mechanism (that's sandbox.md).
The primary agent's dispatch logic (the session manager starts runs; the primary decides what to do within a run).
The plugin host (that's plugin-host.md).
The project DSL format (that's project-dsl.md).
The daemon's process-level lifecycle (that's daemon.md).

Constraints (from ADRs)

Constraint	Source
Web-first; sessions survive operator disconnects	ADR-0002
Runtime is Bun; PTY via `node-pty` or Bun's native PTY (if available)	ADR-0004
Session state persisted to SQLite (or Postgres); durable across daemon restarts	ADR-0005
Auth identity (user_id) threads through sessions for audit and multi-operator	ADR-0007
Subagents spawned in cages; session manager delegates to sandbox subsystem	ADR-0009
Both deployment modes use the same session semantics	ADR-0010

Core concepts

Session

A session is a persistent, named context within a project. It is the unit of "what is currently happening" — an ongoing conversation between the operator and the primary agent, potentially with subagents running work in cages.

Key properties:

Project-scoped. Every session belongs to exactly one project.
Operator-scoped. Every session has a created_by user_id. In per-user mode, this is always the single operator. In system-wide mode, sessions are visible only to the operator who created them (v0; shared sessions are a v2 concern).
Survives disconnect. The operator closing their browser or losing network does not end the session. Work continues. The operator reattaches later.
Survives daemon restart. Session state is persisted continuously. A daemon SIGKILL loses only uncommitted in-flight reasoning (marked as a failed run on next startup).
Named with ULIDs. Session IDs are ULIDs (sortable, unique, no coordination needed). Displayed to the operator as the ULID string (e.g., 01HXAB...).

Run

A run is a single unit of model work within a session: the operator sends a message, the primary processes it (possibly dispatching subagents), and the work completes (or fails, or is cancelled). A session contains an ordered list of runs.

Key properties:

One active run per session at a time (v0). If a message arrives while a run is in progress, the API returns 409 conflict (the session is running — the conflict is per-session, not the concurrency throttle). The concurrency throttle operates across sessions: if the operator's or project's running-session count is at the limit, the message is accepted but the session enters queued with the run in pending.
Runs have their own state (pending, running, completed, failed, cancelled). A run in pending means the message is saved but the primary has not started — either because the session just queued it (concurrency throttle) or because the daemon is about to pick it up (normal brief gap between creation and dispatch).
Runs are the audit boundary. Every run records its input, output, subagent invocations, tool calls, and timing.
Runs can be cancelled. Cancellation kills in-flight subagents and returns control to the operator.

Checkpoint

A checkpoint is a first-class pause in a session. It captures a snapshot of the session's state at a moment in time, allowing the operator to inspect, edit prompts, and either resume or roll back.

Key properties:

Operator-initiated (the operator hits ⏸ in the UI) or model-initiated (the primary or a subagent requests one via an explicit tool call).
Inspectable. The checkpoint contains the message history up to that point, the active subagent states, the current tool calls, and the prompts in effect.
Editable. At a checkpoint, the operator can edit system prompts before resuming.
Reversible. The operator can roll back to a checkpoint, discarding work done after it.

Session state machine

                        create
                          │
                     ┌────▼────┐
                     │  idle   │◀──── run completes / resume with no pending work
                     └────┬────┘
                          │ operator posts message
                          │
                 ┌────────┴────────┐
                 │                 │
        under concurrency     at concurrency
              limit               limit
                 │                 │
            ┌────▼────┐      ┌────▼────┐
            │ running │      │ queued  │
            └──┬───┬──┘      └────┬────┘
               │   │              │ operator resumes
               │   │              │ (slot available)
               │   │         ┌────▼────┐
               │   │         │ running │
               │   │         └─────────┘
               │   │
 run completes │   │ checkpoint (operator or model)
               │   │
          ┌────▼┐ ┌▼────┐
          │idle │ │paused│
          └─────┘ └──┬──┘
                     │ resume / rollback+resume
                ┌────▼────┐
                │ running │
                └─────────┘

    Any state except ended ──── operator ends session ──── ┌───────┐
                                                           │ ended │
                                                           └───────┘

    Any state except ended ──── unrecoverable error ─────── ┌────────┐
                                                            │ failed │
                                                            └────────┘

States

State	Description	Operator can…	Work happening?
`idle`	Session exists, no active work. Awaiting operator input.	Post a message, end the session, view history.	No.
`running`	Active run in progress. Primary and/or subagents are working.	Watch output, request a checkpoint, cancel the run, end the session.	Yes.
`queued`	Message accepted but run not started — concurrency limit is reached. The message is persisted; the run is `pending`.	Resume (when a slot frees), discard the queued message, end the session.	No.
`paused`	At a checkpoint. Work is suspended.	Inspect state, edit prompts, resume, rollback, end the session.	No (subagents suspended or exited).
`ended`	Terminal. Session is complete. No further transitions.	View history.	No.
`failed`	Terminal. Session hit an unrecoverable error.	View history, view the failure details.	No.

Transitions

From	To	Trigger	Guard
(none)	`idle`	`POST /api/v1/projects/:slug/sessions`	Project state is `ready`. No concurrency limit at creation (see § Concurrency).
`idle`	`running`	`POST /api/v1/sessions/:id/messages`	No active run; concurrency limit not reached.
`idle`	`queued`	`POST /api/v1/sessions/:id/messages`	No active run; concurrency limit reached (per-project or per-operator). Message persisted, run created in `pending`.
`queued`	`running`	`POST /api/v1/sessions/:id/resume`	Concurrency slot available; pending run is started.
`queued`	`idle`	`DELETE /api/v1/sessions/:id/queued-message`	Operator discards the queued message. Pending run cancelled, message marked superseded.
`running`	`idle`	Run completes (success or failure of the run itself).	—
`running`	`paused`	Operator requests checkpoint or model requests checkpoint.	At least one active run.
`paused`	`running`	`POST /api/v1/sessions/:id/checkpoints/:cid/resume`	Checkpoint is the current pause point.
`paused`	`running`	`POST /api/v1/sessions/:id/checkpoints/:cid/rollback` + new message	Rollback succeeds; operator posts a new message.
`paused`	`idle`	`POST /api/v1/sessions/:id/checkpoints/:cid/resume` with no pending work	Resume determines nothing is left to do.
any except `ended`	`ended`	`DELETE /api/v1/sessions/:id?confirm=true`	—
any except `ended`	`failed`	Unrecoverable internal error (storage failure, daemon bug).	—

Transition side effects

Transition	Side effects
→ `idle` (from `running`)	Persist run result. Flush WebSocket buffers. Emit `session.state` event. Broadcast `sessions.running_count` on system socket.
→ `idle` (from `queued`)	Cancel pending run (mark `cancelled`). Mark queued message `superseded`. Emit `session.state` event.
→ `running` (from `idle`)	Create a new Run record. Start primary agent processing. Emit `session.state` event. Broadcast `sessions.running_count` on system socket.
→ `running` (from `queued`)	Start the pending run (transition run `pending` → `running`). Start primary agent processing. Emit `session.state` event. Broadcast `sessions.running_count` on system socket.
→ `queued`	Persist operator message. Create Run record in `pending` state. Emit `session.state` event with `queued: true` and `reason: "concurrency_limit"`.
→ `paused`	Suspend in-flight subagents (send `SIGTSTP` to caged processes). Capture checkpoint snapshot. Emit `session.state` event.
→ `ended`	Kill all live subagents (`SIGTERM` → `SIGKILL`). Persist final state. Close WebSocket with `closing { reason: "session_ended" }`. Emit `session.state` event. Broadcast `sessions.running_count` on system socket (if was running).
→ `failed`	Kill all live subagents. Persist failure details. Close WebSocket. Emit `session.state` event. Broadcast `sessions.running_count` on system socket (if was running).

Run lifecycle

Run states

    create (message posted)
         │
    ┌────▼─────┐
    │ pending  │──── primary picks it up ────▶┌──────────┐
    └──────────┘                              │ running  │
                                              └──┬──┬──┬─┘
                                                 │  │  │
                              completes normally │  │  │ operator cancels
                                                 │  │  │
                                           ┌─────▼┐ │ ┌▼──────────┐
                                           │done  │ │ │ cancelled │
                                           └──────┘ │ └───────────┘
                                                     │
                                              error  │
                                                     │
                                              ┌──────▼─┐
                                              │ failed │
                                              └────────┘

State	Description
`pending`	Run created, message queued. Primary has not started processing.
`running`	Primary is actively processing. Subagents may be spawned.
`done`	Run completed successfully. Output is available.
`failed`	Run encountered an error (LLM provider unreachable, subagent cage failure, etc.).
`cancelled`	Operator cancelled the run mid-flight. In-flight subagents killed.

Run record

Each run persists:

interface Run {
  id: string;                    // ULID
  session_id: string;
  created_at: number;            // epoch ms
  completed_at: number | null;
  state: "pending" | "running" | "done" | "failed" | "cancelled";

  // Input
  operator_message: Message;     // the message that triggered this run

  // Output
  primary_response: Message | null;
  subagent_invocations: SubagentInvocation[];
  tool_calls: ToolCall[];

  // Failure
  error: RunError | null;        // present when state is "failed"

  // Timing
  duration_ms: number | null;
  tokens_in: number | null;
  tokens_out: number | null;
}

Cancellation

When the operator cancels a run (POST /api/v1/sessions/:id/runs/:rid/cancel):

The session manager marks the run cancelled.
Sends SIGTERM to all subagents spawned by this run.
Waits 5s for subagents to exit.
SIGKILL survivors.
Cage cleanup (per sandbox.md teardown).
The primary's in-flight LLM request is aborted (the HTTP connection to the provider is closed).
The session transitions from running → idle.
The partial output from the run is preserved (marked as cancelled, not deleted).

Regeneration

Regeneration re-runs the agent from an operator message. It backs the operator's "regenerate" affordance and the UI's transient-error retry loop (see ADR-0052 and http-api.md POST .../messages/:mid/regenerate).

Regeneration introduces no new session state and no new run event. It is a compound operation over primitives that already exist:

Anchor. The target is an existing operator message id (:mid). The operator row is preserved across regenerations, so its id is a stable anchor; the reply's run_id, by contrast, changes on every regeneration and is not a valid anchor.
Supersede-after. All messages created after :mid are marked superseded: true — the same logical-deletion primitive used by checkpoint rollback (see Rollback), here anchored on an operator message id rather than a checkpoint's message cursor. Rows are retained for auditability. :mid itself is not superseded or mutated (messages are immutable).
Dispatch. A new run is created and the primary is dispatched from :mid's content, using the same idle → running (post_message) transition and the same concurrency check as posting a message. If the per-project or per-operator running-session limit is hit, the session instead takes the idle → queued (queue) transition and the run waits, exactly as a posted message would.

Preconditions: the session must be idle (regeneration is rejected with 409 otherwise, mirroring post_message). Because the failed-run path already returns the session to idle, the common "regenerate after a transient failure" case satisfies this precondition without any extra step.

Retry is two-tier (see ADR-0052). Tier 1 is the fast, automatic, in-run retry middleware inside the resolved model (see llm.md §Retry policy) — it retries the open provider call only, never after streaming has begun. Tier 2 is the operator-visible frontend loop that calls this regenerate operation on a backoff after a run fails; it arms only from the failed run's normalized retryable classification and respects any retry_after_until the provider advised. Regeneration itself performs no retry — it is the primitive both the operator's "regenerate" button and Tier 2 invoke. The daemon's single automatic context_overflow compaction retry (see Compaction lifecycle) is a distinct mechanism and is unaffected; Tier 2 must not arm on it.

PTY broker

The PTY broker multiplexes terminal sessions over WebSocket. When a subagent runs a command that needs a terminal (shell, TUI, interactive tool), the broker allocates a PTY, connects it to the subagent's process inside the cage, and streams I/O to the operator's browser.

Architecture

operator's browser
      │
      │ WebSocket (pty channel)
      │
┌─────▼──────┐
│ PTY broker │──── manages ────▶ PTY instances (one per subagent terminal)
└─────┬──────┘
      │ pty fd
      │
┌─────▼──────────┐
│ subagent cage  │
│  └── process   │
│      (shell)   │
└────────────────┘

PTY allocation

PTYs are allocated when:

A subagent's DSL configuration includes terminal: true (explicit).
The primary dispatches a subagent with a tool call that requests terminal access.
The operator explicitly requests a terminal for a running subagent via the UI.

PTY allocation uses node-pty (or Bun's native PTY API if/when available). The master side stays with the broker; the slave side is passed into the cage via file descriptor inheritance.

PTY lifecycle

allocate ──▶ active ──▶ detached ──▶ reattached ──▶ active
                │                                      │
                │                                      │
                ▼                                      ▼
             closed (subagent exits)              closed

State	Description
`active`	Operator is connected; I/O flows both ways.
`detached`	Operator disconnected; PTY is alive; output buffered (scrollback).
`closed`	Subagent exited or session ended. PTY fd closed. Transcript persisted.

I/O flow

Operator → PTY (input):

WebSocket pty channel receives input frame: { type: "stdin", pty_id: "...", data: "<base64>" }.
Broker writes decoded bytes to the PTY master fd.
The shell (inside the cage) reads from the slave fd.

PTY → Operator (output):

Broker reads from the PTY master fd.
Frames output: { type: "stdout", pty_id: "...", data: "<base64>", seq: N }.
Sends over the WebSocket pty channel.
If the operator is disconnected (detached), output is buffered in the scrollback ring.

Resize

The operator's browser sends resize events:

{ "type": "resize", "pty_id": "...", "cols": 120, "rows": 40 }

The broker calls pty.resize(cols, rows), which sends SIGWINCH to the process group inside the cage.

Scrollback and transcript

Each PTY maintains:

Scrollback ring buffer: configurable, default 10,000 lines. Stored in memory while the PTY is active. On PTY close, the full scrollback is persisted to the database as the run's terminal transcript.
Transcript persistence: the daemon stores the transcript as a blob in the run_pty_transcripts table, keyed by (run_id, pty_id). The transcript includes raw bytes (for faithful replay) and a timestamp per chunk (for timed replay).

On reattach, the operator receives the scrollback from the buffer (not from the database — the database is for post-session replay).

PTY limits

Limit	Default	Enforced by
Max concurrent PTYs per session	8	Session manager (matches subagent limit per `daemon.md`)
Scrollback ring size	10,000 lines	PTY broker
Max PTY output rate	1 MiB/sec to WebSocket	PTY broker (drops frames with `pty.output_throttled` warning)
PTY idle timeout	30 minutes (no I/O)	PTY broker (sends notification to operator; does not auto-close)

Checkpoint lifecycle

Checkpoint creation

Operator-initiated:

Operator clicks ⏸ or sends POST /api/v1/sessions/:id/checkpoints.
Session manager transitions session to paused.
Suspends in-flight subagents (SIGTSTP to caged processes; they can be resumed).
Captures the checkpoint snapshot (see below).
Returns the checkpoint ID to the operator.

Model-initiated:

The primary (or a subagent) calls the checkpoint tool: { "tool": "checkpoint", "reason": "Need clarification on deployment target" }.
Session manager transitions session to paused.
Same suspension and snapshot as operator-initiated.
The checkpoint includes the model's stated reason, surfaced in the UI.

Compaction-initiated (reason: compaction_pending), per ADR-0024:

The harness compaction pipeline runs the configured strategy. When strategy: checkpoint, the harness creates a checkpoint with reason: "compaction_pending" and ends the current run with finishReason: "awaiting_compaction".
Session manager transitions session to paused.
The checkpoint snapshot is captured normally plus a sibling CompactionRecord (see § CompactionRecord below) describing the proposed compaction.
The Compactor UI (per ui/compactor.md) inspects the proposed compaction; the operator edits/approves/rejects.
On approve, the operator-approved compaction is applied and the session resumes via the normal checkpoint resume path.
On reject, the checkpoint is rolled back and the session continues without compaction (the operator must then either manually compact with a different strategy or manage the context-overflow themselves).

The reason: compaction_pending is a distinguished value; the UI surfaces it differently from operator/model-initiated checkpoints. The API endpoint for inspecting a compaction-pending checkpoint composes the checkpoint + the proposed CompactionRecord in one response (per http-api.md § Compaction).

Checkpoint snapshot

A checkpoint captures:

interface Checkpoint {
  id: string;                     // ULID
  session_id: string;
  run_id: string;                 // the run that was paused
  created_at: number;             // epoch ms
  created_by: "operator" | "model" | "compaction";   // ADR-0024 adds "compaction"
  reason: string | null;          // model's stated reason, operator's label, or "compaction_pending"
  compaction_id: string | null;   // when created_by == "compaction", references CompactionRecord

  // State snapshot
  message_history_cursor: string; // pointer to the last message at pause time
  active_subagents: SubagentSnapshot[];
  pending_tool_calls: ToolCall[]; // tool calls that were in-flight
  prompts_in_effect: PromptSnapshot[];

  // Metadata
  resumed_at: number | null;
  rolled_back: boolean;
  superseded_by: string | null;   // another checkpoint that replaced this one
}

interface SubagentSnapshot {
  name: string;
  state: "suspended" | "exited";  // suspended = SIGTSTP'd; exited = finished before pause
  cage_id: string | null;
  last_output: string;            // last 500 chars of output, for UI preview
  pty_id: string | null;
}

interface PromptSnapshot {
  target: "primary" | string;     // "primary" or subagent name
  content_hash: string;           // hash of the prompt content at pause time
  editable: boolean;              // always true in v0
}

Checkpoint inspection

At a checkpoint, the operator can:

View the message history up to the pause point.
View active subagent state — which subagents were running, their last output, their cage status.
View pending tool calls — what the model was about to do.
Edit system prompts — change the primary's or any subagent's system prompt. Edits are captured in the audit log.
View the model's reason (if model-initiated) — "I need clarification on..."

Resume

POST /api/v1/sessions/:id/checkpoints/:cid/resume

Validate the checkpoint is the current pause point (not superseded).
Apply any prompt edits the operator made.
Resume suspended subagents (SIGCONT to caged processes).
Transition session to running.
The primary continues from where it paused, with the (possibly edited) prompts.

If the operator edited prompts, the primary receives a system-level message: "System prompts were updated at this checkpoint. Previous prompt hash: X, new prompt hash: Y." This lets the model adjust its approach if the operator changed direction.

Rollback

POST /api/v1/sessions/:id/checkpoints/:cid/rollback

Validate the checkpoint exists in this session.
Kill all subagents spawned after the checkpoint (they're working with post-checkpoint context).
Mark all messages after the checkpoint as superseded: true (they remain in history for auditability, but the UI greys them out).
Transition session to idle (the operator can now post a new message that starts a fresh run from the checkpoint's state).
The next run uses the checkpoint's prompt snapshots as its starting prompts.

Rollback does not delete data. The full history is preserved. Rollback is a pointer operation: it changes where the "current state" points to.

Checkpoint limits

Limit	Default	Rationale
Max checkpoints per session	50	Prevent unbounded storage growth. Oldest auto-archived (queryable, not shown in active list).
Checkpoint snapshot size	~100 KB typical	Message cursor is a pointer, not a copy. Subagent snapshots are summaries.

Compaction lifecycle

Per ADR-0024, context compaction is owned by the harness (per agent.md § Compaction) and operates on the message-reconstruction path. The session manager's role is to persist CompactionRecord rows, transition the session state during strategy: checkpoint events, and surface compaction events to the operator.

CompactionRecord

A CompactionRecord is written for every compaction event (whether triggered automatically by threshold-crossing, by reactive-fallback after a context-length error, or by manual operator trigger).

interface CompactionRecord {
  id: string;                              // ULID
  session_id: string;
  run_id: string;                          // the run whose pre-call check triggered, or the run created by operator_manual
  agent_path: string;                      // which agent's window (per ADR-0022 canonical path)
  created_at: number;                      // epoch ms

  // Trigger and strategy
  trigger: "threshold_crossed" | "provider_overflow_retry" | "operator_manual" | "scheduled";
  strategy: "drop" | "summarize" | "delegate" | "checkpoint";

  // Estimates
  threshold_estimate: number;              // fraction of context window at trigger time (e.g. 0.91)
  after_estimate: number;                  // fraction after compaction (e.g. 0.55)
  window_upper: number;                    // configured upper threshold (e.g. 0.85)
  window_lower: number;                    // configured lower threshold (e.g. 0.60)

  // Message effect
  superseded_message_ids: string[];        // marked superseded by this event
  summary_message_id: string | null;       // new MessageRecord, if summarize/delegate produced one
  summary: string | null;                  // human-readable summary (audit log + UI)

  // Plugin participation
  plugins_fired: Array<{
    name: string;
    role: "observer" | "compactor";
    duration_ms: number;
    result_kind: "inject" | "retain" | "compactor_result" | "null" | "error";
  }>;

  // Cost
  plugin_cost: {
    provider: string;
    model: string;
    input_tokens: number;
    output_tokens: number;
    cost_usd: number;
  } | null;                                 // null when no model invoked (drop, observer-only delegate result)

  // Failure tracking
  fallback_occurred: boolean;              // true if delegate/summarize fell back to drop
  fallback_reason: string | null;          // human-readable reason

  // Operator feedback (per ADR-0024)
  operator_flag: "good" | "bad" | "neutral" | null;
  operator_notes: string | null;

  // Linkage
  checkpoint_id: string | null;            // when strategy == "checkpoint", links the paired Checkpoint
}

Storage

CompactionRecord rows live in the compactions table in kaged's SQLite (per ADR-0005) alongside checkpoints, messages, and sessions. Schema migration is in @kaged/storage; the migration adds the compactions table with indexes on session_id, run_id, and agent_path.

Lifecycle integration

Most strategies (drop / summarize / delegate) — the harness writes the CompactionRecord after the strategy completes and before the next LLM call. The session state does NOT change; the run continues with the compacted message list.
strategy: checkpoint — the harness writes the CompactionRecord with a placeholder summary (the proposed change), creates a paired Checkpoint with reason: "compaction_pending" and compaction_id: <CompactionRecord.id>, and ends the current run. The session transitions to paused per the checkpoint lifecycle. On checkpoint resume:
- Approve → the operator-approved compaction is applied (the CompactionRecord is finalized with the actual superseded_message_ids and summary), the checkpoint marks resumed_at, the session transitions to running, and a new run is started with the compacted list.
- Reject → the CompactionRecord is marked fallback_occurred: true, fallback_reason: "operator_rejected", the checkpoint is rolled back (existing rollback semantics), and the session transitions to idle. No messages are superseded. The operator must now either manually compact with a different strategy (per http-api.md) or manage the context-overflow themselves.

Operator feedback

The operator can attach a flag (good / bad / neutral) and free-text notes to any CompactionRecord after the fact via PATCH /api/v1/sessions/:id/compactions/:cid (per http-api.md). This persists on the CompactionRecord and surfaces in the Compactor UI's history view. Plugin authors iterating on a compactor plugin use these flags as a feedback signal.

Interaction with rollback

If the operator rolls back past a compaction point (rollback target is a checkpoint that predates the compaction):

The standard rollback path runs (per § Rollback).
Messages marked superseded = true by the compaction (its superseded_message_ids) are unsuperseded — they return to active state in the message list.
Any summary_message_id created by the compaction is superseded by the rollback (it was a post-compaction artifact, not present at the rollback target).
The CompactionRecord itself is NOT deleted — it remains in the audit log marked rolled_back: true (a new field on CompactionRecord).

This is the same machinery as message rollback per ADR-0022 / session-manager; compaction integrates with no new flag.

Compaction limits

Limit	Default	Rationale
Max `CompactionRecord` rows per session	1000	Long sessions can accrue many compactions; old rows summarize before being purged. Oldest 100 are aggregated into a single "compactions before T" summary row when the limit is hit.
Compaction event timeout	60s	A strategy that takes longer than 60s is treated as failed and falls back to `drop`.

Reattach semantics

Sessions survive operator disconnects. The reattach flow is designed for the common case: "operator's phone dropped Wi-Fi mid-task."

Disconnect detection

WebSocket close frame received — clean disconnect. Immediate.
WebSocket ping/pong timeout — unclean disconnect (network loss). Detected within 30s (ping interval is 15s, 2 missed pongs = disconnect).
HTTP idle timeout — operator hasn't made any request in 10 minutes. Not a disconnect per se; the session remains attached but the buffer deadline starts.

During disconnect

The daemon:

Keeps the session and all its in-flight work alive. Work does not stop because the operator disconnected.
Buffers WebSocket frames that would have been sent to the operator:
- Output channel: buffered up to 10 minutes or 50 MiB (whichever comes first).
- Events channel: buffered with the same limits.
- PTY channel: NOT buffered for replay (PTY scrollback is in the PTY ring buffer; see PTY broker).
- Control channel: not buffered (control messages are request/response, not streaming).
Records the disconnect in the audit log: session.detached { session_id, reason, user_id }.

Reattach

Operator opens a new WebSocket to /api/v1/sessions/:id/socket.
Sends control { type: "hello", payload: { resume_from_seq: { output: N, events: M } } }.
Daemon checks the buffer:
- Buffer has the requested sequences: replays missed frames from N+1 onward for output, M+1 for events. Sends welcome { session_id, server_seq: { ... } } first.
- Buffer gap (too old or overflowed): sends closing { code: "resume_failed" }. Client must do a full re-fetch via GET /api/v1/sessions/:id/messages?since=<last-seen-message-id>.
PTY reattach: the broker sends the current scrollback content for any active PTYs as a burst of pty.stdout frames. This is the PTY ring buffer content, not a replay of every byte since disconnect.
Session is now attached again. Normal streaming resumes.

Multi-device

v0 allows one operator connection per session at a time (per http-api.md). A second WebSocket upgrade attempt while attached returns 409 conflict.

To move between devices:

Close the WebSocket on the old device (or wait for idle timeout).
Open on the new device.

v1 may add read-only observer connections (second device can watch but not interact). Not v0.

Persistence

What is persisted and when

Data	Persisted when	Storage
Session record (id, project, state, created_by, timestamps)	On creation, on every state transition	`sessions` table
Messages (operator and model)	Immediately on receipt/generation	`messages` table
Run records	On creation, on completion/failure/cancel	`runs` table
Subagent invocations	On spawn, on exit	`subagent_invocations` table
Tool calls and results	On call, on result	`tool_calls` table
Checkpoints	On creation	`checkpoints` table
PTY transcripts	On PTY close	`run_pty_transcripts` table
Prompt edits (at checkpoints)	On edit	`prompt_edits` table

Persistence guarantees

No data loss on clean shutdown. The daemon flushes all pending writes before exiting.
Minimal data loss on SIGKILL. SQLite WAL mode ensures committed transactions survive. The only loss is the current in-flight LLM response (which is streamed and not yet committed). On next startup, the session manager:
1. Scans for sessions in running state.
2. Checks if the corresponding run has a committed response.
3. If not, marks the run as failed with error: "daemon_crash_during_run".
4. Transitions the session to idle.
5. Logs session.crash_recovered in the audit log.

Schema sketch

CREATE TABLE sessions (
  id            TEXT PRIMARY KEY,   -- ULID
  project_id    TEXT NOT NULL,
  created_by    TEXT NOT NULL,      -- user_id
  state         TEXT NOT NULL,      -- idle, running, queued, paused, ended, failed
  created_at    INTEGER NOT NULL,   -- epoch ms
  updated_at    INTEGER NOT NULL,
  ended_at      INTEGER,
  forked_from   TEXT,               -- session ID if this was forked
  model TEXT,              -- "provider:model" — overrides DSL primary.model for this session
  bound_issue   TEXT,               -- ADR-0034: FK to issues.id — session's bound issue (nullable)
  FOREIGN KEY (project_id) REFERENCES projects(id)
);

CREATE TABLE messages (
  id            TEXT PRIMARY KEY,   -- ULID
  session_id    TEXT NOT NULL,
  run_id        TEXT,               -- null for the initial operator message before a run starts
  role          TEXT NOT NULL,      -- operator, primary, subagent, system
  content       TEXT NOT NULL,
  created_at    INTEGER NOT NULL,
  superseded    INTEGER DEFAULT 0,  -- 1 if rolled back past
  metadata      TEXT,               -- JSON: token counts, model used, etc.
  FOREIGN KEY (session_id) REFERENCES sessions(id)
);

CREATE TABLE runs (
  id            TEXT PRIMARY KEY,   -- ULID
  session_id    TEXT NOT NULL,
  state         TEXT NOT NULL,      -- pending, running, done, failed, cancelled
  created_at    INTEGER NOT NULL,
  completed_at  INTEGER,
  duration_ms   INTEGER,
  tokens_in     INTEGER,
  tokens_out    INTEGER,
  error         TEXT,               -- JSON: RunError, null if no error
  FOREIGN KEY (session_id) REFERENCES sessions(id)
);

CREATE TABLE subagent_invocations (
  id            TEXT PRIMARY KEY,   -- ULID
  run_id        TEXT NOT NULL,
  session_id    TEXT NOT NULL,
  subagent_name TEXT NOT NULL,
  cage_id       TEXT,               -- null if uncaged
  state         TEXT NOT NULL,      -- spawned, running, done, failed, killed
  spawned_at    INTEGER NOT NULL,
  exited_at     INTEGER,
  exit_code     INTEGER,
  FOREIGN KEY (run_id) REFERENCES runs(id)
);

CREATE TABLE tool_calls (
  id            TEXT PRIMARY KEY,   -- ULID
  run_id        TEXT NOT NULL,
  session_id    TEXT NOT NULL,
  caller        TEXT NOT NULL,      -- "primary" or subagent name
  tool_name     TEXT NOT NULL,
  input         TEXT NOT NULL,      -- JSON
  output        TEXT,               -- JSON, null if pending
  state         TEXT NOT NULL,      -- pending, done, failed
  created_at    INTEGER NOT NULL,
  completed_at  INTEGER,
  FOREIGN KEY (run_id) REFERENCES runs(id)
);

CREATE TABLE checkpoints (
  id            TEXT PRIMARY KEY,   -- ULID
  session_id    TEXT NOT NULL,
  run_id        TEXT NOT NULL,
  created_at    INTEGER NOT NULL,
  created_by    TEXT NOT NULL,      -- "operator" or "model"
  reason        TEXT,
  snapshot      TEXT NOT NULL,       -- JSON: CheckpointSnapshot
  resumed_at    INTEGER,
  rolled_back   INTEGER DEFAULT 0,
  superseded_by TEXT,
  FOREIGN KEY (session_id) REFERENCES sessions(id)
);

CREATE TABLE run_pty_transcripts (
  id            TEXT PRIMARY KEY,
  run_id        TEXT NOT NULL,
  pty_id        TEXT NOT NULL,
  transcript    BLOB NOT NULL,      -- raw bytes with timing markers
  line_count    INTEGER NOT NULL,
  created_at    INTEGER NOT NULL,
  FOREIGN KEY (run_id) REFERENCES runs(id)
);

CREATE TABLE prompt_edits (
  id            TEXT PRIMARY KEY,   -- ULID
  checkpoint_id TEXT NOT NULL,
  session_id    TEXT NOT NULL,
  target        TEXT NOT NULL,      -- "primary" or subagent name
  old_hash      TEXT NOT NULL,
  new_hash      TEXT NOT NULL,
  new_content   TEXT NOT NULL,      -- the edited prompt text
  edited_at     INTEGER NOT NULL,
  FOREIGN KEY (checkpoint_id) REFERENCES checkpoints(id)
);

-- Indexes
CREATE INDEX idx_messages_session ON messages(session_id, created_at);
CREATE INDEX idx_runs_session ON runs(session_id, created_at);
CREATE INDEX idx_subagent_invocations_run ON subagent_invocations(run_id);
CREATE INDEX idx_tool_calls_run ON tool_calls(run_id);
CREATE INDEX idx_checkpoints_session ON checkpoints(session_id, created_at);

All tables use TEXT primary keys (ULIDs). Timestamps are integer epoch milliseconds (per ADR-0005 implementation notes). JSON columns use TEXT with application-level validation.

Session forking

An operator can create a new session forked from an existing one:

POST /api/v1/projects/music/sessions
{
  "resume_from": "01HXAB..."
}

Forking:

Creates a new session with forked_from = <source_session_id>.
Copies the message history from the source session up to the current point (or up to a specified checkpoint).
Does NOT share state — the fork is a snapshot copy. Changes to the original don't affect the fork.
The new session starts in idle state.

Forking is a relatively expensive operation (message copy). It is not a branching model like git — there's no merge. Forks are independent sessions that happen to share an origin.

Session-issue binding (ADR-0034)

Per ADR-0034, a session binds to at most one issue. The agent's working checklist is stored as issue-owned todos, mutated through the root-only kaged.todo tool that operates implicitly on the bound issue.

Binding semantics

Session-side pointer. bound_issue is a nullable column on the sessions table. It points at an issues.id row. An issue may be worked by many sessions over its lifetime, one active at a time.
Binding is not a state transition. Binding and unbinding from the UI/sidebar attaches or detaches the pointer; it does not mutate, clear, or complete stored todos. Unbinding simply puts the existing (possibly incomplete) todos out of reach of the current session. Re-binding the same issue rehydrates them.
kaged.todo requires a binding. With no binding, the tool resolves to an error that tells the agent to ask the operator to bind an issue. This is deliberate: the issue is the storage, so no issue means no todos.

Bind/unbind endpoints

Endpoint	Method	Description
`/api/v1/sessions/:id/bind`	`PUT`	Bind an issue to the session. Body: `{ "issue_id": "<ulid>" }`. Validates the issue exists.
`/api/v1/sessions/:id/bind`	`DELETE`	Unbind the issue from the session. Sets `bound_issue` to `null`.

Session creation with binding

POST /api/v1/projects/:slug/sessions accepts an optional bound_issue field in the body. When present, the session is created with the binding already set. This is the "send to agent" path from the issue detail screen.

Auto-conclude answer

Per ADR-0034, ending a session never silently resolves its bound issue. The operator confirms closure at the closing checkpoint. When the agent has marked the last open criterion done, it calls kaged.checkpoint with a reason like "acceptance criteria met, requesting sign-off." The session pauses. The operator reviews, then resumes; the resume path transitions the issue to resolved (via kaged.issue transition or the UI) and the session may conclude.

This resolves the standing open question: manual session termination does not auto-resolve an assigned issue. The operator confirms, at the closing checkpoint.

Concurrency

Session-level concurrency

Resource	Limit	Enforced when	Enforced by
Max concurrent running sessions per project	4	Message send (`post_message`)	Session manager
Max concurrent running sessions per operator (across projects)	16	Message send (`post_message`)	Session manager
Max concurrent runs per session	1	Message send (`post_message`)	Session manager (v0; sequential runs only)
Max concurrent subagents per run	8	Subagent dispatch	Subagent supervisor (per `daemon.md`)
Max concurrent PTYs per session	8	PTY allocation	PTY broker

Concurrency is checked at message-send time, not session-creation time. Session creation always succeeds (as long as the project is ready). The running-session count is checked when the operator posts a message:

Under the limit: the run starts immediately. Session transitions idle → running.
At the limit: the message is persisted, a run is created in pending, and the session transitions idle → queued. The operator sees a pause indicator and can resume when a slot frees.

No auto-dequeue. When a running session ends and frees a slot, the daemon broadcasts the updated count via the system WebSocket (sessions.running_count) but does not automatically start queued sessions. The operator explicitly resumes each queued session via POST /api/v1/sessions/:id/resume. This is deliberate: the operator decides which queued session gets the freed slot, not the daemon.

What counts as "running": only sessions in the running state. Sessions in idle, queued, paused, ended, or failed do not count against the concurrency limit. A paused session has suspended its run (checkpoint) and released the slot; a queued session never started its run.

Exceeding a per-session limit (one run per session — posting while running) returns 409 conflict. Exceeding a cross-session concurrency limit (per-project or per-operator) does not return an error — the message is queued.

Locking

Session lock: each session has an in-memory mutex. All state transitions acquire the lock. This prevents races between "operator cancels run" and "run completes naturally" arriving simultaneously.
No cross-session locks. Sessions are independent. A deadlock between sessions is architecturally impossible.
Database writes are serialized by SQLite's write lock (or Postgres' transaction isolation). The session manager does not add its own database-level locking beyond the engine's guarantees.

DSL hot-reload interaction

Per daemon.md:

DSL changes are applied at next session-start, not to active sessions.
Active sessions continue with the DSL they were started under.
The UI shows a "DSL changed; new sessions will use the updated config" indicator.
The operator can use "restart with new DSL" in the session UI, which ends the current session and creates a new one with the updated DSL.

The session manager stores the DSL version (a hash of the project.yaml content) at session creation time. This is used to:

Detect stale sessions (DSL changed since session started).
Replay/audit: know which DSL was in effect during a session.

Audit events

Event	When	Data
`session.created`	New session	session_id, project_id, user_id, forked_from
`session.queued`	Message accepted but throttled by concurrency limit	session_id, run_id, message_id, reason (`per_project` or `per_operator`), running_count, limit
`session.resumed_from_queue`	Operator resumed a queued session	session_id, run_id, user_id, running_count
`session.queued_discarded`	Operator discarded a queued message	session_id, run_id, message_id, user_id
`session.state`	State transition	session_id, from_state, to_state, trigger
`session.attached`	Operator connected via WebSocket	session_id, user_id, device_hint
`session.detached`	Operator disconnected	session_id, user_id, reason (clean/timeout/error)
`session.ended`	Session ended	session_id, user_id, run_count, duration
`session.crash_recovered`	Daemon restarted; session was in `running`	session_id, failed_run_id
`run.created`	New run started	run_id, session_id, message_preview
`run.completed`	Run finished	run_id, state, duration_ms, tokens
`run.cancelled`	Operator cancelled a run	run_id, session_id, user_id
`checkpoint.created`	Checkpoint taken	checkpoint_id, session_id, created_by, reason
`checkpoint.resumed`	Checkpoint resumed	checkpoint_id, session_id, prompts_edited
`checkpoint.rolled_back`	Rollback to checkpoint	checkpoint_id, session_id, messages_superseded
`prompt.edited`	Prompt changed at checkpoint	checkpoint_id, target, old_hash, new_hash
`pty.allocated`	PTY created for a subagent	pty_id, session_id, subagent_name
`pty.closed`	PTY closed	pty_id, reason, transcript_lines
`pty.reattached`	PTY reconnected after disconnect	pty_id, scrollback_lines_sent

Failure modes

Failure	Detection	Recovery	Operator impact
LLM provider unreachable during run	HTTP timeout / connection refused	Run marked `failed`. Session → `idle`. Operator retries.	Message "Provider unreachable. Try again or check provider config."
Subagent cage fails to spawn	Sandbox compiler error or bwrap exec failure	Run marked `failed` with cage error details. Session → `idle`.	Error details shown; operator checks DSL cage config.
Subagent exceeds walltime	Cage walltime watchdog	Subagent killed. Run continues with partial output if other subagents are active; otherwise run marked `failed`.	Warning: "Subagent X exceeded walltime limit."
Subagent OOM	cgroup OOM killer	Same as walltime.	Warning: "Subagent X exceeded memory limit."
Database write failure	SQLite/Postgres error	Session transitions to `failed`. Daemon logs critical error.	Session lost. Operator creates a new session.
Daemon SIGKILL during run	Next startup scan	Run marked `failed`. Session → `idle`.	"Previous run was interrupted by a daemon restart."
WebSocket disconnect mid-run	Ping/pong timeout	Work continues. Frames buffered. Operator reattaches.	Seamless if reattach within buffer window; full re-fetch otherwise.
Checkpoint resume with edited prompts fails	Primary rejects the prompt	Run starts but immediately fails. Session → `idle`.	"Prompt edit caused an error. Review the edited prompt."
Fork source session not found	Database lookup miss	Fork request returns 404.	"Source session not found."

Testing notes

State machine tests

Every transition: assert the guard condition, the side effects, and the resulting state.
Invalid transitions: assert rejection (e.g., ended → running returns error).
Concurrent transitions: two threads try to transition the same session simultaneously. Assert one wins, one gets a conflict.
Crash recovery: simulate daemon kill during running. Assert next startup marks run as failed and session as idle.

Concurrency throttle tests

Under limit: post message with running count below limit. Assert session → running, run starts.
At per-project limit: create 4 running sessions in a project, post a message on a 5th. Assert session → queued, run is pending, message is persisted.
At per-operator limit: create 16 running sessions across projects, post a message on a 17th. Assert session → queued.
Resume from queued: session is queued, slot frees, operator calls resume. Assert session → running, pending run starts.
Resume when still at limit: session is queued, no slot freed. Operator calls resume. Assert 409 conflict (slot not available).
Discard queued message: session is queued, operator discards. Assert session → idle, run marked cancelled, message marked superseded.
Paused session frees slot: session A is running, session B is queued. Checkpoint session A. Assert session A → paused, running count decrements. Operator can now resume session B.
Session creation ignores concurrency: create 5 sessions in a project (limit is 4). Assert all 5 succeed with idle state. Concurrency is not checked at creation.
System count broadcast: start/end a run. Assert sessions.running_count event fires on the system socket with correct counts.

PTY tests

Allocation: request a PTY for a subagent. Assert the PTY is created and I/O flows.
Resize: send resize event. Assert SIGWINCH reaches the process.
Detach/reattach: disconnect the WebSocket. Assert PTY stays alive. Reconnect. Assert scrollback is delivered.
Transcript persistence: close a PTY. Assert the transcript is in the database. Fetch via API. Assert content matches.
Rate limiting: flood the PTY output channel. Assert frames are dropped with a warning, not that the daemon OOMs.

Checkpoint tests

Create (operator): request checkpoint during a run. Assert session pauses, subagents suspended.
Create (model): mock a primary that calls the checkpoint tool. Assert same behavior.
Resume: resume a checkpoint. Assert subagents get SIGCONT. Assert session → running.
Resume with prompt edit: edit a prompt at a checkpoint, resume. Assert the primary sees the edit notification.
Rollback: rollback to a checkpoint. Assert later messages are marked superseded. Assert next run starts from the checkpoint state.
Rollback to old checkpoint: rollback to a checkpoint that is not the most recent. Assert intermediate checkpoints are superseded.

Persistence tests

Message ordering: post 100 messages. Assert they come back in order.
Run recording: complete a run. Assert all fields are persisted.
Session survives restart: create a session, restart the daemon. Assert session is still there with correct state.
Fork: fork a session. Assert message history is copied. Modify original. Assert fork is unaffected.

Reattach tests

Clean reattach within buffer window: disconnect, reconnect within 10 minutes. Assert missed frames are replayed.
Reattach after buffer overflow: disconnect, wait for buffer to overflow. Assert resume_failed and client falls back to HTTP re-fetch.
Multi-device conflict: connect from device A, try to connect from device B. Assert 409.

Open questions

Parallel runs. v0 is one run per session. Should v0.x allow queuing multiple messages (the primary processes them sequentially) or true parallel runs (multiple primaries)? Sequential queuing seems low-cost; true parallel is a big complexity bump.
Session sharing between operators. v0 is single-operator sessions. v1 may want shared sessions (one operator watches another's session in read-only). Needs auth model extension.
Session archival. Old sessions accumulate. When do they get archived? v0: never automatically. Operator uses DELETE to end them. A future kaged session archive --older-than 30d is plausible.
Subagent suspension fidelity. SIGTSTP to a caged process works for simple commands but may not work for all programs (some ignore TSTP, some corrupt state). v0 documents the limitation; v0.x may add a "snapshot and kill" checkpoint mode for subagents that can't be suspended.
PTY replay fidelity. Timed replay of PTY transcripts is useful for debugging but the timing markers add storage overhead. v0 stores timestamps per chunk (every ~100ms batch); finer granularity deferred.
Session memory limit. A session with thousands of messages may exhaust the context window of the primary model. The session manager does not manage context windows (that's the primary's job), but should it enforce a message-count limit? v0: no limit. The primary is responsible for its own context management.

Amendments

2026-05-23 — Per-session model override

Added model column to the sessions table and the corresponding model field to SessionRecord. When set, it is a "provider:model" string that overrides the DSL's primary.model alias for all runs in this session. The daemon's dispatch path checks session.model before alias resolution — if set, it splits the override into provider + model, resolves the provider's credentials from local config, and constructs the ProviderRoute directly, bypassing alias lookup entirely.

The override is a session-level persistent setting, not per-message. It is set via PUT /api/v1/sessions/:id (see http-api.md) and can be cleared by setting it to null. The UI exposes a model picker in the session input area that reads from the operator's configured providers and their persisted model catalogs.

POST /api/v1/sessions/:id/messages also accepts an optional model field in the request body. When present, it is persisted to the session record before dispatch, becoming the session's override for this and all subsequent messages (until changed or cleared). This enables per-message model switching while maintaining the "session-level override" semantic — the last-used model is sticky.

2026-05-27 — ADR-0024: compaction-pending checkpoints and `CompactionRecord`

Per ADR-0024:

Checkpoint created_by extended. New value "compaction" added alongside "operator" and "model". Distinguished by the harness's compaction pipeline when strategy: checkpoint is configured. New compaction_id field on Checkpoint links to the paired CompactionRecord.
reason: "compaction_pending" is a distinguished value indicating the checkpoint was created by a strategy: checkpoint compaction event awaiting operator review. The Compactor UI (per ui/compactor.md) surfaces these checkpoints differently.
New CompactionRecord shape defined. Persisted in the compactions SQLite table. Fields include trigger, strategy, threshold estimates, message effect (superseded IDs, summary message), plugin participation, cost, fallback tracking, operator feedback (operator_flag, operator_notes), and checkpoint_id linkage. Schema migration in @kaged/storage adds the table with indexes on session_id, run_id, agent_path.
Compaction lifecycle section added documenting how compaction integrates with the session manager: most strategies don't change session state, but strategy: checkpoint creates a paired checkpoint and pauses the session. Approve / reject flows specified.
Operator feedback persistence. PATCH /api/v1/sessions/:id/compactions/:cid (per http-api.md) writes operator_flag and operator_notes to the CompactionRecord. Used by the Compactor UI's history view.
Rollback interaction. Rolling back past a compaction unsuperseses the affected superseded_message_ids. The summary message (if any) is superseded by the rollback. The CompactionRecord is preserved with rolled_back: true (a new field).
Compaction limits added — 1000 CompactionRecord rows per session before aggregation; 60s timeout per compaction event with fallback to drop.
Constrained-by list extended with ADR-0024.

2026-06-05 — ADR-0034: session-issue binding, bind/unbind transitions, auto-conclude answer

Per ADR-0034:

bound_issue column added to the sessions table schema. Nullable FK to issues.id. Stored on SessionRecord as boundIssue: string | null.
Session-issue binding section added. Documents binding semantics: session-side pointer, not a state transition; binding/unbinding does not mutate todos; kaged.todo requires a binding.
Bind/unbind endpoints documented. PUT /api/v1/sessions/:id/bind and DELETE /api/v1/sessions/:id/bind.
Session creation with binding. POST /api/v1/projects/:slug/sessions accepts optional bound_issue field.
Auto-conclude answer. Ending a session never silently resolves its bound issue. The operator confirms at the closing checkpoint. Resolves the standing open question.
Constrained-by list extended with ADR-0034.

2026-06-05 — Concurrency throttle at message-send: `queued` state, system-wide running count, operator-explicit resume

The concurrency model is restructured: limits are checked at message-send time, not session-creation time, and exceeded messages queue rather than reject.

New session state: queued. When the operator posts a message and the running-session count (per-project or per-operator) is at the limit, the session enters queued instead of rejecting the message. The message is persisted, a run is created in pending, and no primary dispatch occurs.
Concurrency limits moved from creation to message-send. checkSessionCreationLimits is removed from session creation. Session creation always succeeds (project must be ready). The per-project (4) and per-operator (16) limits are checked in post_message instead.
Operator-explicit resume. The daemon does not auto-dequeue queued sessions when slots free. The operator calls POST /api/v1/sessions/:id/resume to start the pending run. This gives the operator control over which queued session gets a freed slot.
Queued message discard. The operator can discard a queued message via DELETE /api/v1/sessions/:id/queued-message, which cancels the pending run, marks the message superseded, and transitions the session back to idle.
System-wide running count broadcast. Every run start/end broadcasts sessions.running_count on the system WebSocket (per http-api.md § System WebSocket). The count includes per-project and per-operator breakdowns so the UI can show both the global count and whether a specific session is near its limit.
paused sessions release the slot. A checkpoint-paused session transitions running → paused and no longer counts against the concurrency limit. This is consistent with the existing checkpoint semantics (work is suspended) and means pausing a session can free a slot for a queued one.
State machine diagram, states table, transitions table, transition side effects, and audit events updated. Three new audit events: session.queued, session.resumed_from_queue, session.queued_discarded.
"What counts as running" clarified. Only running sessions count against the limit. idle, queued, paused, ended, and failed do not.

2026-06-05: Sticky todo reminder injection in primary runner

The primary runner's message pipeline now includes a sticky todo reminder injection point. After spend-limit enforcement and before the LLM call, the runner resolves the session's bound_issue, loads its todos, and conditionally appends an ephemeral system-role message listing open items. The message is not persisted, existing only for that single LLM call. Suppression logic prevents redundancy when a kaged.todo tool result is already in the preceding turn. See agent-tooling.md § kaged.todo for the full specification.

2026-07-01 — Continue-suppression at tool boundaries

Backs the UI auto-continue affordance (see ui/README.md § 2026-07-01). The UI always posts a literal { content: "continue" } when auto-continue fires; the daemon decides whether that operator turn is actually needed, so the transcript is not polluted with redundant "continue" turns when the previous run merely ran out of step budget mid-tool-loop.

Continuation signal. POST /api/v1/sessions/:id/messages accepts an optional boolean continuation flag on the request body (default false). The UI sets it true for auto-continue posts. When absent/false, this amendment does not apply and the message is handled exactly as before (a normal operator turn). The flag — not string-matching the content — is the trigger; an operator who literally types the word "continue" is unaffected.
Tool-boundary detection. On a continuation: true post, before persisting the operator message, the daemon inspects the tail of the session's active (non-superseded) message history — the same reconstruction used to build the next run's context. The previous run ended at a tool boundary iff the last active message is a toolResult, or is an assistant message carrying tool calls with no corresponding tool results after it (an unresolved tool-call tail). Otherwise the previous run ended on text (last active message is assistant text with no pending tool calls).
Suppression rule.
- Ended at a tool boundary → the daemon does not persist the "continue" operator message. It creates a new run and dispatches the primary on the existing reconstructed history (which already ends in a pending tool result / unresolved tool call — a sufficient signal for the model to proceed). The idle → running (or idle → queued under the concurrency throttle) transition is identical to a normal post; only the operator-message insert is skipped.
- Ended on text → the daemon persists "continue" as a normal operator message and dispatches as usual. Continuing from a text end requires a real instruction, so the turn is kept.
In both cases stopReason is not consulted: the decision is purely the message-tail shape. (A length stop can be either tail; a tool-boundary tail can also arise without a length stop.) This keeps the rule robust to the harness collapsing step- and token-limits into a single "length" (per agent.md).
No new session/run state. Continue-suppression is a compound operation over existing primitives (identical to how Regeneration reuses the post_message transition). It introduces no new state, no new transition, and no new audit event beyond the usual run.created. The suppressed-message case simply omits one messages row insert.
Runaway safety is upstream. The daemon never self-initiates a continuation — it only ever reacts to an operator (UI) continuation: true post. Chaining is driven entirely by the client toggle, which is ephemeral and browser-local (see the paired UI amendment). With no connected UI, no continue is posted, so no run chains. There is deliberately no daemon-side auto-continue loop.
Testing notes. (a) continuation: true with a toolResult tail → no operator messages row is written and a new run dispatches on existing history; (b) continuation: true with an unresolved-tool-call assistant tail → same suppression; (c) continuation: true with an assistant-text tail → "continue" is persisted as an operator message and dispatched; (d) continuation absent/false → message always persisted (baseline behavior); (e) suppression respects the concurrency throttle identically (tool-boundary continue while at the limit → queued with a pending run, no operator row); (f) an operator literally typing "continue" without the flag is never suppressed.

References

ADR-0002 — web-first, sessions survive disconnects
ADR-0004 — Bun runtime, PTY integration
ADR-0005 — storage engine, schema portability
ADR-0007 — identity threading through sessions
ADR-0009 — cage mechanism for subagents within sessions
ADR-0010 — deployment modes, same session semantics
ADR-0024 — context compaction; CompactionRecord and compaction_pending checkpoint
agent.md § Compaction — the harness-side compaction pipeline this section integrates with
http-api.md — the API surface the session manager implements
daemon.md — process model, supervisor, operating limits
sandbox.md — cage spawn/teardown called by the session manager
plugin-host.md — plugins called during runs
project-dsl.md — DSL that defines session shape (subagents, cages, prompts)
local-config.md — alias resolution at session-start

Spec: Session Manager

Purpose

Constraints (from ADRs)

Core concepts

Session

Run

Checkpoint

Session state machine

States

Transitions

Transition side effects

Run lifecycle

Run states

Run record

Cancellation

Regeneration

PTY broker

Architecture

PTY allocation

PTY lifecycle

I/O flow

Resize

Scrollback and transcript

PTY limits

Checkpoint lifecycle

Checkpoint creation

Checkpoint snapshot

Checkpoint inspection

Resume

Rollback

Checkpoint limits

Compaction lifecycle

CompactionRecord

Storage

Lifecycle integration

Operator feedback

Interaction with rollback

Compaction limits

Reattach semantics

Disconnect detection

During disconnect

Reattach

Multi-device

Persistence

What is persisted and when

Persistence guarantees

Schema sketch

Session forking

Session-issue binding (ADR-0034)

Binding semantics

Bind/unbind endpoints

Session creation with binding

Auto-conclude answer

Concurrency

Session-level concurrency

Locking

DSL hot-reload interaction

Audit events

Failure modes

Testing notes

State machine tests

Concurrency throttle tests

PTY tests

Checkpoint tests

Persistence tests

Reattach tests

Open questions

Amendments

2026-05-23 — Per-session model override

2026-05-27 — ADR-0024: compaction-pending checkpoints and CompactionRecord

2026-06-05 — ADR-0034: session-issue binding, bind/unbind transitions, auto-conclude answer

2026-06-05 — Concurrency throttle at message-send: queued state, system-wide running count, operator-explicit resume

2026-06-05: Sticky todo reminder injection in primary runner

2026-07-01 — Continue-suppression at tool boundaries

References

2026-05-27 — ADR-0024: compaction-pending checkpoints and `CompactionRecord`

2026-06-05 — Concurrency throttle at message-send: `queued` state, system-wide running count, operator-explicit resume