Spec: Session Manager
- Status: Draft
- Last amended: 2026-06-05 (concurrency throttle at message-send —
queuedstate, system-wide running count, operator-explicit resume) - Constrained by: ADR-0002, ADR-0004, ADR-0005, ADR-0007, ADR-0009, ADR-0010, ADR-0024, ADR-0034
- Implements:
packages/session-manager/(planned)
Purpose
This spec defines the session manager: the daemon subsystem that owns the lifecycle of sessions — from creation through active work to checkpoint, detach, reattach, and termination. It is the binding layer between the operator's intent (via the HTTP/WS API) and the daemon's execution machinery (primary agent, subagent supervisor, plugin host).
This document is normative for:
- The session state machine — every state, every transition, every guard.
- The PTY broker — how terminal sessions are multiplexed, attached, detached, resized, and persisted.
- The checkpoint lifecycle — creation, inspection, resume, rollback, and the state they capture.
- The reattach semantics — what happens when the operator reconnects after a disconnect.
- The run model — what a "run" is, how runs relate to sessions, cancellation, and failure.
- The persistence contract — what is durably stored, when, and in what shape.
- The concurrency model — how many sessions, subagents, and runs can be active simultaneously.
It is not normative for:
- The HTTP/WS API shape (that's
http-api.md— the session manager implements the behavior behind those endpoints). - The sandbox or cage mechanism (that's
sandbox.md). - The primary agent's dispatch logic (the session manager starts runs; the primary decides what to do within a run).
- The plugin host (that's
plugin-host.md). - The project DSL format (that's
project-dsl.md). - The daemon's process-level lifecycle (that's
daemon.md).
Constraints (from ADRs)
| Constraint | Source |
|---|---|
| Web-first; sessions survive operator disconnects | ADR-0002 |
Runtime is Bun; PTY via node-pty or Bun's native PTY (if available) |
ADR-0004 |
| Session state persisted to SQLite (or Postgres); durable across daemon restarts | ADR-0005 |
| Auth identity (user_id) threads through sessions for audit and multi-operator | ADR-0007 |
| Subagents spawned in cages; session manager delegates to sandbox subsystem | ADR-0009 |
| Both deployment modes use the same session semantics | ADR-0010 |
Core concepts
Session
A session is a persistent, named context within a project. It is the unit of "what is currently happening" — an ongoing conversation between the operator and the primary agent, potentially with subagents running work in cages.
Key properties:
- Project-scoped. Every session belongs to exactly one project.
- Operator-scoped. Every session has a
created_byuser_id. In per-user mode, this is always the single operator. In system-wide mode, sessions are visible only to the operator who created them (v0; shared sessions are a v2 concern). - Survives disconnect. The operator closing their browser or losing network does not end the session. Work continues. The operator reattaches later.
- Survives daemon restart. Session state is persisted continuously. A daemon SIGKILL loses only uncommitted in-flight reasoning (marked as a failed run on next startup).
- Named with ULIDs. Session IDs are ULIDs (sortable, unique, no coordination needed). Displayed to the operator as the ULID string (e.g.,
01HXAB...).
Run
A run is a single unit of model work within a session: the operator sends a message, the primary processes it (possibly dispatching subagents), and the work completes (or fails, or is cancelled). A session contains an ordered list of runs.
Key properties:
- One active run per session at a time (v0). If a message arrives while a run is in progress, the API returns 409
conflict(the session isrunning— the conflict is per-session, not the concurrency throttle). The concurrency throttle operates across sessions: if the operator's or project's running-session count is at the limit, the message is accepted but the session entersqueuedwith the run inpending. - Runs have their own state (
pending,running,completed,failed,cancelled). A run inpendingmeans the message is saved but the primary has not started — either because the session just queued it (concurrency throttle) or because the daemon is about to pick it up (normal brief gap between creation and dispatch). - Runs are the audit boundary. Every run records its input, output, subagent invocations, tool calls, and timing.
- Runs can be cancelled. Cancellation kills in-flight subagents and returns control to the operator.
Checkpoint
A checkpoint is a first-class pause in a session. It captures a snapshot of the session's state at a moment in time, allowing the operator to inspect, edit prompts, and either resume or roll back.
Key properties:
- Operator-initiated (the operator hits ⏸ in the UI) or model-initiated (the primary or a subagent requests one via an explicit tool call).
- Inspectable. The checkpoint contains the message history up to that point, the active subagent states, the current tool calls, and the prompts in effect.
- Editable. At a checkpoint, the operator can edit system prompts before resuming.
- Reversible. The operator can roll back to a checkpoint, discarding work done after it.
Session state machine
create
│
┌────▼────┐
│ idle │◀──── run completes / resume with no pending work
└────┬────┘
│ operator posts message
│
┌────────┴────────┐
│ │
under concurrency at concurrency
limit limit
│ │
┌────▼────┐ ┌────▼────┐
│ running │ │ queued │
└──┬───┬──┘ └────┬────┘
│ │ │ operator resumes
│ │ │ (slot available)
│ │ ┌────▼────┐
│ │ │ running │
│ │ └─────────┘
│ │
run completes │ │ checkpoint (operator or model)
│ │
┌────▼┐ ┌▼────┐
│idle │ │paused│
└─────┘ └──┬──┘
│ resume / rollback+resume
┌────▼────┐
│ running │
└─────────┘
Any state except ended ──── operator ends session ──── ┌───────┐
│ ended │
└───────┘
Any state except ended ──── unrecoverable error ─────── ┌────────┐
│ failed │
└────────┘
States
| State | Description | Operator can… | Work happening? |
|---|---|---|---|
idle |
Session exists, no active work. Awaiting operator input. | Post a message, end the session, view history. | No. |
running |
Active run in progress. Primary and/or subagents are working. | Watch output, request a checkpoint, cancel the run, end the session. | Yes. |
queued |
Message accepted but run not started — concurrency limit is reached. The message is persisted; the run is pending. |
Resume (when a slot frees), discard the queued message, end the session. | No. |
paused |
At a checkpoint. Work is suspended. | Inspect state, edit prompts, resume, rollback, end the session. | No (subagents suspended or exited). |
ended |
Terminal. Session is complete. No further transitions. | View history. | No. |
failed |
Terminal. Session hit an unrecoverable error. | View history, view the failure details. | No. |
Transitions
| From | To | Trigger | Guard |
|---|---|---|---|
| (none) | idle |
POST /api/v1/projects/:id/sessions |
Project state is ready. No concurrency limit at creation (see § Concurrency). |
idle |
running |
POST /api/v1/sessions/:id/messages |
No active run; concurrency limit not reached. |
idle |
queued |
POST /api/v1/sessions/:id/messages |
No active run; concurrency limit reached (per-project or per-operator). Message persisted, run created in pending. |
queued |
running |
POST /api/v1/sessions/:id/resume |
Concurrency slot available; pending run is started. |
queued |
idle |
DELETE /api/v1/sessions/:id/queued-message |
Operator discards the queued message. Pending run cancelled, message marked superseded. |
running |
idle |
Run completes (success or failure of the run itself). | — |
running |
paused |
Operator requests checkpoint or model requests checkpoint. | At least one active run. |
paused |
running |
POST /api/v1/sessions/:id/checkpoints/:cid/resume |
Checkpoint is the current pause point. |
paused |
running |
POST /api/v1/sessions/:id/checkpoints/:cid/rollback + new message |
Rollback succeeds; operator posts a new message. |
paused |
idle |
POST /api/v1/sessions/:id/checkpoints/:cid/resume with no pending work |
Resume determines nothing is left to do. |
any except ended |
ended |
DELETE /api/v1/sessions/:id?confirm=true |
— |
any except ended |
failed |
Unrecoverable internal error (storage failure, daemon bug). | — |
Transition side effects
| Transition | Side effects |
|---|---|
→ idle (from running) |
Persist run result. Flush WebSocket buffers. Emit session.state event. Broadcast sessions.running_count on system socket. |
→ idle (from queued) |
Cancel pending run (mark cancelled). Mark queued message superseded. Emit session.state event. |
→ running (from idle) |
Create a new Run record. Start primary agent processing. Emit session.state event. Broadcast sessions.running_count on system socket. |
→ running (from queued) |
Start the pending run (transition run pending → running). Start primary agent processing. Emit session.state event. Broadcast sessions.running_count on system socket. |
→ queued |
Persist operator message. Create Run record in pending state. Emit session.state event with queued: true and reason: "concurrency_limit". |
→ paused |
Suspend in-flight subagents (send SIGTSTP to caged processes). Capture checkpoint snapshot. Emit session.state event. |
→ ended |
Kill all live subagents (SIGTERM → SIGKILL). Persist final state. Close WebSocket with closing { reason: "session_ended" }. Emit session.state event. Broadcast sessions.running_count on system socket (if was running). |
→ failed |
Kill all live subagents. Persist failure details. Close WebSocket. Emit session.state event. Broadcast sessions.running_count on system socket (if was running). |
Run lifecycle
Run states
create (message posted)
│
┌────▼─────┐
│ pending │──── primary picks it up ────▶┌──────────┐
└──────────┘ │ running │
└──┬──┬──┬─┘
│ │ │
completes normally │ │ │ operator cancels
│ │ │
┌─────▼┐ │ ┌▼──────────┐
│done │ │ │ cancelled │
└──────┘ │ └───────────┘
│
error │
│
┌──────▼─┐
│ failed │
└────────┘
| State | Description |
|---|---|
pending |
Run created, message queued. Primary has not started processing. |
running |
Primary is actively processing. Subagents may be spawned. |
done |
Run completed successfully. Output is available. |
failed |
Run encountered an error (LLM provider unreachable, subagent cage failure, etc.). |
cancelled |
Operator cancelled the run mid-flight. In-flight subagents killed. |
Run record
Each run persists:
interface Run {
id: string; // ULID
session_id: string;
created_at: number; // epoch ms
completed_at: number | null;
state: "pending" | "running" | "done" | "failed" | "cancelled";
// Input
operator_message: Message; // the message that triggered this run
// Output
primary_response: Message | null;
subagent_invocations: SubagentInvocation[];
tool_calls: ToolCall[];
// Failure
error: RunError | null; // present when state is "failed"
// Timing
duration_ms: number | null;
tokens_in: number | null;
tokens_out: number | null;
}
Cancellation
When the operator cancels a run (POST /api/v1/sessions/:id/runs/:rid/cancel):
- The session manager marks the run
cancelled. - Sends
SIGTERMto all subagents spawned by this run. - Waits 5s for subagents to exit.
SIGKILLsurvivors.- Cage cleanup (per
sandbox.mdteardown). - The primary's in-flight LLM request is aborted (the HTTP connection to the provider is closed).
- The session transitions from
running→idle. - The partial output from the run is preserved (marked as cancelled, not deleted).
PTY broker
The PTY broker multiplexes terminal sessions over WebSocket. When a subagent runs a command that needs a terminal (shell, TUI, interactive tool), the broker allocates a PTY, connects it to the subagent's process inside the cage, and streams I/O to the operator's browser.
Architecture
operator's browser
│
│ WebSocket (pty channel)
│
┌─────▼──────┐
│ PTY broker │──── manages ────▶ PTY instances (one per subagent terminal)
└─────┬──────┘
│ pty fd
│
┌─────▼──────────┐
│ subagent cage │
│ └── process │
│ (shell) │
└────────────────┘
PTY allocation
PTYs are allocated when:
- A subagent's DSL configuration includes
terminal: true(explicit). - The primary dispatches a subagent with a tool call that requests terminal access.
- The operator explicitly requests a terminal for a running subagent via the UI.
PTY allocation uses node-pty (or Bun's native PTY API if/when available). The master side stays with the broker; the slave side is passed into the cage via file descriptor inheritance.
PTY lifecycle
allocate ──▶ active ──▶ detached ──▶ reattached ──▶ active
│ │
│ │
▼ ▼
closed (subagent exits) closed
| State | Description |
|---|---|
active |
Operator is connected; I/O flows both ways. |
detached |
Operator disconnected; PTY is alive; output buffered (scrollback). |
closed |
Subagent exited or session ended. PTY fd closed. Transcript persisted. |
I/O flow
Operator → PTY (input):
- WebSocket
ptychannel receives input frame:{ type: "stdin", pty_id: "...", data: "<base64>" }. - Broker writes decoded bytes to the PTY master fd.
- The shell (inside the cage) reads from the slave fd.
PTY → Operator (output):
- Broker reads from the PTY master fd.
- Frames output:
{ type: "stdout", pty_id: "...", data: "<base64>", seq: N }. - Sends over the WebSocket
ptychannel. - If the operator is disconnected (detached), output is buffered in the scrollback ring.
Resize
The operator's browser sends resize events:
{ "type": "resize", "pty_id": "...", "cols": 120, "rows": 40 }
The broker calls pty.resize(cols, rows), which sends SIGWINCH to the process group inside the cage.
Scrollback and transcript
Each PTY maintains:
- Scrollback ring buffer: configurable, default 10,000 lines. Stored in memory while the PTY is active. On PTY close, the full scrollback is persisted to the database as the run's terminal transcript.
- Transcript persistence: the daemon stores the transcript as a blob in the
run_pty_transcriptstable, keyed by(run_id, pty_id). The transcript includes raw bytes (for faithful replay) and a timestamp per chunk (for timed replay).
On reattach, the operator receives the scrollback from the buffer (not from the database — the database is for post-session replay).
PTY limits
| Limit | Default | Enforced by |
|---|---|---|
| Max concurrent PTYs per session | 8 | Session manager (matches subagent limit per daemon.md) |
| Scrollback ring size | 10,000 lines | PTY broker |
| Max PTY output rate | 1 MiB/sec to WebSocket | PTY broker (drops frames with pty.output_throttled warning) |
| PTY idle timeout | 30 minutes (no I/O) | PTY broker (sends notification to operator; does not auto-close) |
Checkpoint lifecycle
Checkpoint creation
Operator-initiated:
- Operator clicks ⏸ or sends
POST /api/v1/sessions/:id/checkpoints. - Session manager transitions session to
paused. - Suspends in-flight subagents (
SIGTSTPto caged processes; they can be resumed). - Captures the checkpoint snapshot (see below).
- Returns the checkpoint ID to the operator.
Model-initiated:
- The primary (or a subagent) calls the
checkpointtool:{ "tool": "checkpoint", "reason": "Need clarification on deployment target" }. - Session manager transitions session to
paused. - Same suspension and snapshot as operator-initiated.
- The checkpoint includes the model's stated
reason, surfaced in the UI.
Compaction-initiated (reason: compaction_pending), per ADR-0024:
- The harness compaction pipeline runs the configured strategy. When
strategy: checkpoint, the harness creates a checkpoint withreason: "compaction_pending"and ends the current run withfinishReason: "awaiting_compaction". - Session manager transitions session to
paused. - The checkpoint snapshot is captured normally plus a sibling
CompactionRecord(see § CompactionRecord below) describing the proposed compaction. - The Compactor UI (per
ui/compactor.md) inspects the proposed compaction; the operator edits/approves/rejects. - On approve, the operator-approved compaction is applied and the session resumes via the normal checkpoint resume path.
- On reject, the checkpoint is rolled back and the session continues without compaction (the operator must then either manually compact with a different strategy or manage the context-overflow themselves).
The reason: compaction_pending is a distinguished value; the UI surfaces it differently from operator/model-initiated checkpoints. The API endpoint for inspecting a compaction-pending checkpoint composes the checkpoint + the proposed CompactionRecord in one response (per http-api.md § Compaction).
Checkpoint snapshot
A checkpoint captures:
interface Checkpoint {
id: string; // ULID
session_id: string;
run_id: string; // the run that was paused
created_at: number; // epoch ms
created_by: "operator" | "model" | "compaction"; // ADR-0024 adds "compaction"
reason: string | null; // model's stated reason, operator's label, or "compaction_pending"
compaction_id: string | null; // when created_by == "compaction", references CompactionRecord
// State snapshot
message_history_cursor: string; // pointer to the last message at pause time
active_subagents: SubagentSnapshot[];
pending_tool_calls: ToolCall[]; // tool calls that were in-flight
prompts_in_effect: PromptSnapshot[];
// Metadata
resumed_at: number | null;
rolled_back: boolean;
superseded_by: string | null; // another checkpoint that replaced this one
}
interface SubagentSnapshot {
name: string;
state: "suspended" | "exited"; // suspended = SIGTSTP'd; exited = finished before pause
cage_id: string | null;
last_output: string; // last 500 chars of output, for UI preview
pty_id: string | null;
}
interface PromptSnapshot {
target: "primary" | string; // "primary" or subagent name
content_hash: string; // hash of the prompt content at pause time
editable: boolean; // always true in v0
}
Checkpoint inspection
At a checkpoint, the operator can:
- View the message history up to the pause point.
- View active subagent state — which subagents were running, their last output, their cage status.
- View pending tool calls — what the model was about to do.
- Edit system prompts — change the primary's or any subagent's system prompt. Edits are captured in the audit log.
- View the model's reason (if model-initiated) — "I need clarification on..."
Resume
POST /api/v1/sessions/:id/checkpoints/:cid/resume
- Validate the checkpoint is the current pause point (not superseded).
- Apply any prompt edits the operator made.
- Resume suspended subagents (
SIGCONTto caged processes). - Transition session to
running. - The primary continues from where it paused, with the (possibly edited) prompts.
If the operator edited prompts, the primary receives a system-level message: "System prompts were updated at this checkpoint. Previous prompt hash: X, new prompt hash: Y." This lets the model adjust its approach if the operator changed direction.
Rollback
POST /api/v1/sessions/:id/checkpoints/:cid/rollback
- Validate the checkpoint exists in this session.
- Kill all subagents spawned after the checkpoint (they're working with post-checkpoint context).
- Mark all messages after the checkpoint as
superseded: true(they remain in history for auditability, but the UI greys them out). - Transition session to
idle(the operator can now post a new message that starts a fresh run from the checkpoint's state). - The next run uses the checkpoint's prompt snapshots as its starting prompts.
Rollback does not delete data. The full history is preserved. Rollback is a pointer operation: it changes where the "current state" points to.
Checkpoint limits
| Limit | Default | Rationale |
|---|---|---|
| Max checkpoints per session | 50 | Prevent unbounded storage growth. Oldest auto-archived (queryable, not shown in active list). |
| Checkpoint snapshot size | ~100 KB typical | Message cursor is a pointer, not a copy. Subagent snapshots are summaries. |
Compaction lifecycle
Per ADR-0024, context compaction is owned by the harness (per agent.md § Compaction) and operates on the message-reconstruction path. The session manager's role is to persist CompactionRecord rows, transition the session state during strategy: checkpoint events, and surface compaction events to the operator.
CompactionRecord
A CompactionRecord is written for every compaction event (whether triggered automatically by threshold-crossing, by reactive-fallback after a context-length error, or by manual operator trigger).
interface CompactionRecord {
id: string; // ULID
session_id: string;
run_id: string; // the run whose pre-call check triggered, or the run created by operator_manual
agent_path: string; // which agent's window (per ADR-0022 canonical path)
created_at: number; // epoch ms
// Trigger and strategy
trigger: "threshold_crossed" | "provider_overflow_retry" | "operator_manual" | "scheduled";
strategy: "drop" | "summarize" | "delegate" | "checkpoint";
// Estimates
threshold_estimate: number; // fraction of context window at trigger time (e.g. 0.91)
after_estimate: number; // fraction after compaction (e.g. 0.55)
window_upper: number; // configured upper threshold (e.g. 0.85)
window_lower: number; // configured lower threshold (e.g. 0.60)
// Message effect
superseded_message_ids: string[]; // marked superseded by this event
summary_message_id: string | null; // new MessageRecord, if summarize/delegate produced one
summary: string | null; // human-readable summary (audit log + UI)
// Plugin participation
plugins_fired: Array<{
name: string;
role: "observer" | "compactor";
duration_ms: number;
result_kind: "inject" | "retain" | "compactor_result" | "null" | "error";
}>;
// Cost
plugin_cost: {
provider: string;
model: string;
input_tokens: number;
output_tokens: number;
cost_usd: number;
} | null; // null when no model invoked (drop, observer-only delegate result)
// Failure tracking
fallback_occurred: boolean; // true if delegate/summarize fell back to drop
fallback_reason: string | null; // human-readable reason
// Operator feedback (per ADR-0024)
operator_flag: "good" | "bad" | "neutral" | null;
operator_notes: string | null;
// Linkage
checkpoint_id: string | null; // when strategy == "checkpoint", links the paired Checkpoint
}
Storage
CompactionRecord rows live in the compactions table in kaged's SQLite (per ADR-0005) alongside checkpoints, messages, and sessions. Schema migration is in @kaged/storage; the migration adds the compactions table with indexes on session_id, run_id, and agent_path.
Lifecycle integration
- Most strategies (drop / summarize / delegate) — the harness writes the
CompactionRecordafter the strategy completes and before the next LLM call. The session state does NOT change; the run continues with the compacted message list. strategy: checkpoint— the harness writes theCompactionRecordwith a placeholdersummary(the proposed change), creates a pairedCheckpointwithreason: "compaction_pending"andcompaction_id: <CompactionRecord.id>, and ends the current run. The session transitions topausedper the checkpoint lifecycle. On checkpoint resume:- Approve → the operator-approved compaction is applied (the
CompactionRecordis finalized with the actualsuperseded_message_idsandsummary), the checkpoint marksresumed_at, the session transitions torunning, and a new run is started with the compacted list. - Reject → the
CompactionRecordis markedfallback_occurred: true, fallback_reason: "operator_rejected", the checkpoint is rolled back (existing rollback semantics), and the session transitions toidle. No messages are superseded. The operator must now either manually compact with a different strategy (perhttp-api.md) or manage the context-overflow themselves.
- Approve → the operator-approved compaction is applied (the
Operator feedback
The operator can attach a flag (good / bad / neutral) and free-text notes to any CompactionRecord after the fact via PATCH /api/v1/sessions/:id/compactions/:cid (per http-api.md). This persists on the CompactionRecord and surfaces in the Compactor UI's history view. Plugin authors iterating on a compactor plugin use these flags as a feedback signal.
Interaction with rollback
If the operator rolls back past a compaction point (rollback target is a checkpoint that predates the compaction):
- The standard rollback path runs (per § Rollback).
- Messages marked
superseded = trueby the compaction (itssuperseded_message_ids) are unsuperseded — they return to active state in the message list. - Any
summary_message_idcreated by the compaction is superseded by the rollback (it was a post-compaction artifact, not present at the rollback target). - The
CompactionRecorditself is NOT deleted — it remains in the audit log markedrolled_back: true(a new field onCompactionRecord).
This is the same machinery as message rollback per ADR-0022 / session-manager; compaction integrates with no new flag.
Compaction limits
| Limit | Default | Rationale |
|---|---|---|
Max CompactionRecord rows per session |
1000 | Long sessions can accrue many compactions; old rows summarize before being purged. Oldest 100 are aggregated into a single "compactions before T" summary row when the limit is hit. |
| Compaction event timeout | 60s | A strategy that takes longer than 60s is treated as failed and falls back to drop. |
Reattach semantics
Sessions survive operator disconnects. The reattach flow is designed for the common case: "operator's phone dropped Wi-Fi mid-task."
Disconnect detection
- WebSocket close frame received — clean disconnect. Immediate.
- WebSocket ping/pong timeout — unclean disconnect (network loss). Detected within 30s (ping interval is 15s, 2 missed pongs = disconnect).
- HTTP idle timeout — operator hasn't made any request in 10 minutes. Not a disconnect per se; the session remains attached but the buffer deadline starts.
During disconnect
The daemon:
- Keeps the session and all its in-flight work alive. Work does not stop because the operator disconnected.
- Buffers WebSocket frames that would have been sent to the operator:
- Output channel: buffered up to 10 minutes or 50 MiB (whichever comes first).
- Events channel: buffered with the same limits.
- PTY channel: NOT buffered for replay (PTY scrollback is in the PTY ring buffer; see PTY broker).
- Control channel: not buffered (control messages are request/response, not streaming).
- Records the disconnect in the audit log:
session.detached { session_id, reason, user_id }.
Reattach
- Operator opens a new WebSocket to
/api/v1/sessions/:id/socket. - Sends
control { type: "hello", payload: { resume_from_seq: { output: N, events: M } } }. - Daemon checks the buffer:
- Buffer has the requested sequences: replays missed frames from
N+1onward for output,M+1for events. Sendswelcome { session_id, server_seq: { ... } }first. - Buffer gap (too old or overflowed): sends
closing { code: "resume_failed" }. Client must do a full re-fetch viaGET /api/v1/sessions/:id/messages?since=<last-seen-message-id>.
- Buffer has the requested sequences: replays missed frames from
- PTY reattach: the broker sends the current scrollback content for any active PTYs as a burst of
pty.stdoutframes. This is the PTY ring buffer content, not a replay of every byte since disconnect. - Session is now
attachedagain. Normal streaming resumes.
Multi-device
v0 allows one operator connection per session at a time (per http-api.md). A second WebSocket upgrade attempt while attached returns 409 conflict.
To move between devices:
- Close the WebSocket on the old device (or wait for idle timeout).
- Open on the new device.
v1 may add read-only observer connections (second device can watch but not interact). Not v0.
Persistence
What is persisted and when
| Data | Persisted when | Storage |
|---|---|---|
| Session record (id, project, state, created_by, timestamps) | On creation, on every state transition | sessions table |
| Messages (operator and model) | Immediately on receipt/generation | messages table |
| Run records | On creation, on completion/failure/cancel | runs table |
| Subagent invocations | On spawn, on exit | subagent_invocations table |
| Tool calls and results | On call, on result | tool_calls table |
| Checkpoints | On creation | checkpoints table |
| PTY transcripts | On PTY close | run_pty_transcripts table |
| Prompt edits (at checkpoints) | On edit | prompt_edits table |
Persistence guarantees
- No data loss on clean shutdown. The daemon flushes all pending writes before exiting.
- Minimal data loss on SIGKILL. SQLite WAL mode ensures committed transactions survive. The only loss is the current in-flight LLM response (which is streamed and not yet committed). On next startup, the session manager:
- Scans for sessions in
runningstate. - Checks if the corresponding run has a committed response.
- If not, marks the run as
failedwitherror: "daemon_crash_during_run". - Transitions the session to
idle. - Logs
session.crash_recoveredin the audit log.
- Scans for sessions in
Schema sketch
CREATE TABLE sessions (
id TEXT PRIMARY KEY, -- ULID
project_id TEXT NOT NULL,
created_by TEXT NOT NULL, -- user_id
state TEXT NOT NULL, -- idle, running, queued, paused, ended, failed
created_at INTEGER NOT NULL, -- epoch ms
updated_at INTEGER NOT NULL,
ended_at INTEGER,
forked_from TEXT, -- session ID if this was forked
model_override TEXT, -- "provider:model" — overrides DSL primary.model for this session
bound_issue TEXT, -- ADR-0034: FK to issues.id — session's bound issue (nullable)
FOREIGN KEY (project_id) REFERENCES projects(id)
);
CREATE TABLE messages (
id TEXT PRIMARY KEY, -- ULID
session_id TEXT NOT NULL,
run_id TEXT, -- null for the initial operator message before a run starts
role TEXT NOT NULL, -- operator, primary, subagent, system
content TEXT NOT NULL,
created_at INTEGER NOT NULL,
superseded INTEGER DEFAULT 0, -- 1 if rolled back past
metadata TEXT, -- JSON: token counts, model used, etc.
FOREIGN KEY (session_id) REFERENCES sessions(id)
);
CREATE TABLE runs (
id TEXT PRIMARY KEY, -- ULID
session_id TEXT NOT NULL,
state TEXT NOT NULL, -- pending, running, done, failed, cancelled
created_at INTEGER NOT NULL,
completed_at INTEGER,
duration_ms INTEGER,
tokens_in INTEGER,
tokens_out INTEGER,
error TEXT, -- JSON: RunError, null if no error
FOREIGN KEY (session_id) REFERENCES sessions(id)
);
CREATE TABLE subagent_invocations (
id TEXT PRIMARY KEY, -- ULID
run_id TEXT NOT NULL,
session_id TEXT NOT NULL,
subagent_name TEXT NOT NULL,
cage_id TEXT, -- null if uncaged
state TEXT NOT NULL, -- spawned, running, done, failed, killed
spawned_at INTEGER NOT NULL,
exited_at INTEGER,
exit_code INTEGER,
FOREIGN KEY (run_id) REFERENCES runs(id)
);
CREATE TABLE tool_calls (
id TEXT PRIMARY KEY, -- ULID
run_id TEXT NOT NULL,
session_id TEXT NOT NULL,
caller TEXT NOT NULL, -- "primary" or subagent name
tool_name TEXT NOT NULL,
input TEXT NOT NULL, -- JSON
output TEXT, -- JSON, null if pending
state TEXT NOT NULL, -- pending, done, failed
created_at INTEGER NOT NULL,
completed_at INTEGER,
FOREIGN KEY (run_id) REFERENCES runs(id)
);
CREATE TABLE checkpoints (
id TEXT PRIMARY KEY, -- ULID
session_id TEXT NOT NULL,
run_id TEXT NOT NULL,
created_at INTEGER NOT NULL,
created_by TEXT NOT NULL, -- "operator" or "model"
reason TEXT,
snapshot TEXT NOT NULL, -- JSON: CheckpointSnapshot
resumed_at INTEGER,
rolled_back INTEGER DEFAULT 0,
superseded_by TEXT,
FOREIGN KEY (session_id) REFERENCES sessions(id)
);
CREATE TABLE run_pty_transcripts (
id TEXT PRIMARY KEY,
run_id TEXT NOT NULL,
pty_id TEXT NOT NULL,
transcript BLOB NOT NULL, -- raw bytes with timing markers
line_count INTEGER NOT NULL,
created_at INTEGER NOT NULL,
FOREIGN KEY (run_id) REFERENCES runs(id)
);
CREATE TABLE prompt_edits (
id TEXT PRIMARY KEY, -- ULID
checkpoint_id TEXT NOT NULL,
session_id TEXT NOT NULL,
target TEXT NOT NULL, -- "primary" or subagent name
old_hash TEXT NOT NULL,
new_hash TEXT NOT NULL,
new_content TEXT NOT NULL, -- the edited prompt text
edited_at INTEGER NOT NULL,
FOREIGN KEY (checkpoint_id) REFERENCES checkpoints(id)
);
-- Indexes
CREATE INDEX idx_messages_session ON messages(session_id, created_at);
CREATE INDEX idx_runs_session ON runs(session_id, created_at);
CREATE INDEX idx_subagent_invocations_run ON subagent_invocations(run_id);
CREATE INDEX idx_tool_calls_run ON tool_calls(run_id);
CREATE INDEX idx_checkpoints_session ON checkpoints(session_id, created_at);
All tables use TEXT primary keys (ULIDs). Timestamps are integer epoch milliseconds (per ADR-0005 implementation notes). JSON columns use TEXT with application-level validation.
Session forking
An operator can create a new session forked from an existing one:
POST /api/v1/projects/music/sessions
{
"resume_from": "01HXAB..."
}
Forking:
- Creates a new session with
forked_from = <source_session_id>. - Copies the message history from the source session up to the current point (or up to a specified checkpoint).
- Does NOT share state — the fork is a snapshot copy. Changes to the original don't affect the fork.
- The new session starts in
idlestate.
Forking is a relatively expensive operation (message copy). It is not a branching model like git — there's no merge. Forks are independent sessions that happen to share an origin.
Session-issue binding (ADR-0034)
Per ADR-0034, a session binds to at most one issue. The agent's working checklist is stored as issue-owned todos, mutated through the root-only kaged.todo tool that operates implicitly on the bound issue.
Binding semantics
- Session-side pointer.
bound_issueis a nullable column on thesessionstable. It points at anissues.idrow. An issue may be worked by many sessions over its lifetime, one active at a time. - Binding is not a state transition. Binding and unbinding from the UI/sidebar attaches or detaches the pointer; it does not mutate, clear, or complete stored todos. Unbinding simply puts the existing (possibly incomplete) todos out of reach of the current session. Re-binding the same issue rehydrates them.
kaged.todorequires a binding. With no binding, the tool resolves to an error that tells the agent to ask the operator to bind an issue. This is deliberate: the issue is the storage, so no issue means no todos.
Bind/unbind endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/v1/sessions/:id/bind |
PUT |
Bind an issue to the session. Body: { "issue_id": "<ulid>" }. Validates the issue exists. |
/api/v1/sessions/:id/bind |
DELETE |
Unbind the issue from the session. Sets bound_issue to null. |
Session creation with binding
POST /api/v1/projects/:id/sessions accepts an optional bound_issue field in the body. When present, the session is created with the binding already set. This is the "send to agent" path from the issue detail screen.
Auto-conclude answer
Per ADR-0034, ending a session never silently resolves its bound issue. The operator confirms closure at the closing checkpoint. When the agent has marked the last open criterion done, it calls kaged.checkpoint with a reason like "acceptance criteria met, requesting sign-off." The session pauses. The operator reviews, then resumes; the resume path transitions the issue to resolved (via kaged.issue transition or the UI) and the session may conclude.
This resolves the standing open question: manual session termination does not auto-resolve an assigned issue. The operator confirms, at the closing checkpoint.
Concurrency
Session-level concurrency
| Resource | Limit | Enforced when | Enforced by |
|---|---|---|---|
| Max concurrent running sessions per project | 4 | Message send (post_message) |
Session manager |
| Max concurrent running sessions per operator (across projects) | 16 | Message send (post_message) |
Session manager |
| Max concurrent runs per session | 1 | Message send (post_message) |
Session manager (v0; sequential runs only) |
| Max concurrent subagents per run | 8 | Subagent dispatch | Subagent supervisor (per daemon.md) |
| Max concurrent PTYs per session | 8 | PTY allocation | PTY broker |
Concurrency is checked at message-send time, not session-creation time. Session creation always succeeds (as long as the project is ready). The running-session count is checked when the operator posts a message:
- Under the limit: the run starts immediately. Session transitions
idle→running. - At the limit: the message is persisted, a run is created in
pending, and the session transitionsidle→queued. The operator sees a pause indicator and can resume when a slot frees.
No auto-dequeue. When a running session ends and frees a slot, the daemon broadcasts the updated count via the system WebSocket (sessions.running_count) but does not automatically start queued sessions. The operator explicitly resumes each queued session via POST /api/v1/sessions/:id/resume. This is deliberate: the operator decides which queued session gets the freed slot, not the daemon.
What counts as "running": only sessions in the running state. Sessions in idle, queued, paused, ended, or failed do not count against the concurrency limit. A paused session has suspended its run (checkpoint) and released the slot; a queued session never started its run.
Exceeding a per-session limit (one run per session — posting while running) returns 409 conflict. Exceeding a cross-session concurrency limit (per-project or per-operator) does not return an error — the message is queued.
Locking
- Session lock: each session has an in-memory mutex. All state transitions acquire the lock. This prevents races between "operator cancels run" and "run completes naturally" arriving simultaneously.
- No cross-session locks. Sessions are independent. A deadlock between sessions is architecturally impossible.
- Database writes are serialized by SQLite's write lock (or Postgres' transaction isolation). The session manager does not add its own database-level locking beyond the engine's guarantees.
DSL hot-reload interaction
Per daemon.md:
- DSL changes are applied at next session-start, not to active sessions.
- Active sessions continue with the DSL they were started under.
- The UI shows a "DSL changed; new sessions will use the updated config" indicator.
- The operator can use "restart with new DSL" in the session UI, which ends the current session and creates a new one with the updated DSL.
The session manager stores the DSL version (a hash of the project.yaml content) at session creation time. This is used to:
- Detect stale sessions (DSL changed since session started).
- Replay/audit: know which DSL was in effect during a session.
Audit events
| Event | When | Data |
|---|---|---|
session.created |
New session | session_id, project_id, user_id, forked_from |
session.queued |
Message accepted but throttled by concurrency limit | session_id, run_id, message_id, reason (per_project or per_operator), running_count, limit |
session.resumed_from_queue |
Operator resumed a queued session | session_id, run_id, user_id, running_count |
session.queued_discarded |
Operator discarded a queued message | session_id, run_id, message_id, user_id |
session.state |
State transition | session_id, from_state, to_state, trigger |
session.attached |
Operator connected via WebSocket | session_id, user_id, device_hint |
session.detached |
Operator disconnected | session_id, user_id, reason (clean/timeout/error) |
session.ended |
Session ended | session_id, user_id, run_count, duration |
session.crash_recovered |
Daemon restarted; session was in running |
session_id, failed_run_id |
run.created |
New run started | run_id, session_id, message_preview |
run.completed |
Run finished | run_id, state, duration_ms, tokens |
run.cancelled |
Operator cancelled a run | run_id, session_id, user_id |
checkpoint.created |
Checkpoint taken | checkpoint_id, session_id, created_by, reason |
checkpoint.resumed |
Checkpoint resumed | checkpoint_id, session_id, prompts_edited |
checkpoint.rolled_back |
Rollback to checkpoint | checkpoint_id, session_id, messages_superseded |
prompt.edited |
Prompt changed at checkpoint | checkpoint_id, target, old_hash, new_hash |
pty.allocated |
PTY created for a subagent | pty_id, session_id, subagent_name |
pty.closed |
PTY closed | pty_id, reason, transcript_lines |
pty.reattached |
PTY reconnected after disconnect | pty_id, scrollback_lines_sent |
Failure modes
| Failure | Detection | Recovery | Operator impact |
|---|---|---|---|
| LLM provider unreachable during run | HTTP timeout / connection refused | Run marked failed. Session → idle. Operator retries. |
Message "Provider unreachable. Try again or check provider config." |
| Subagent cage fails to spawn | Sandbox compiler error or bwrap exec failure | Run marked failed with cage error details. Session → idle. |
Error details shown; operator checks DSL cage config. |
| Subagent exceeds walltime | Cage walltime watchdog | Subagent killed. Run continues with partial output if other subagents are active; otherwise run marked failed. |
Warning: "Subagent X exceeded walltime limit." |
| Subagent OOM | cgroup OOM killer | Same as walltime. | Warning: "Subagent X exceeded memory limit." |
| Database write failure | SQLite/Postgres error | Session transitions to failed. Daemon logs critical error. |
Session lost. Operator creates a new session. |
| Daemon SIGKILL during run | Next startup scan | Run marked failed. Session → idle. |
"Previous run was interrupted by a daemon restart." |
| WebSocket disconnect mid-run | Ping/pong timeout | Work continues. Frames buffered. Operator reattaches. | Seamless if reattach within buffer window; full re-fetch otherwise. |
| Checkpoint resume with edited prompts fails | Primary rejects the prompt | Run starts but immediately fails. Session → idle. |
"Prompt edit caused an error. Review the edited prompt." |
| Fork source session not found | Database lookup miss | Fork request returns 404. | "Source session not found." |
Testing notes
State machine tests
- Every transition: assert the guard condition, the side effects, and the resulting state.
- Invalid transitions: assert rejection (e.g.,
ended→runningreturns error). - Concurrent transitions: two threads try to transition the same session simultaneously. Assert one wins, one gets a conflict.
- Crash recovery: simulate daemon kill during
running. Assert next startup marks run asfailedand session asidle.
Concurrency throttle tests
- Under limit: post message with running count below limit. Assert session →
running, run starts. - At per-project limit: create 4 running sessions in a project, post a message on a 5th. Assert session →
queued, run ispending, message is persisted. - At per-operator limit: create 16 running sessions across projects, post a message on a 17th. Assert session →
queued. - Resume from queued: session is
queued, slot frees, operator calls resume. Assert session →running, pending run starts. - Resume when still at limit: session is
queued, no slot freed. Operator calls resume. Assert 409conflict(slot not available). - Discard queued message: session is
queued, operator discards. Assert session →idle, run markedcancelled, message markedsuperseded. - Paused session frees slot: session A is
running, session B isqueued. Checkpoint session A. Assert session A →paused, running count decrements. Operator can now resume session B. - Session creation ignores concurrency: create 5 sessions in a project (limit is 4). Assert all 5 succeed with
idlestate. Concurrency is not checked at creation. - System count broadcast: start/end a run. Assert
sessions.running_countevent fires on the system socket with correct counts.
PTY tests
- Allocation: request a PTY for a subagent. Assert the PTY is created and I/O flows.
- Resize: send resize event. Assert SIGWINCH reaches the process.
- Detach/reattach: disconnect the WebSocket. Assert PTY stays alive. Reconnect. Assert scrollback is delivered.
- Transcript persistence: close a PTY. Assert the transcript is in the database. Fetch via API. Assert content matches.
- Rate limiting: flood the PTY output channel. Assert frames are dropped with a warning, not that the daemon OOMs.
Checkpoint tests
- Create (operator): request checkpoint during a run. Assert session pauses, subagents suspended.
- Create (model): mock a primary that calls the
checkpointtool. Assert same behavior. - Resume: resume a checkpoint. Assert subagents get SIGCONT. Assert session →
running. - Resume with prompt edit: edit a prompt at a checkpoint, resume. Assert the primary sees the edit notification.
- Rollback: rollback to a checkpoint. Assert later messages are marked
superseded. Assert next run starts from the checkpoint state. - Rollback to old checkpoint: rollback to a checkpoint that is not the most recent. Assert intermediate checkpoints are superseded.
Persistence tests
- Message ordering: post 100 messages. Assert they come back in order.
- Run recording: complete a run. Assert all fields are persisted.
- Session survives restart: create a session, restart the daemon. Assert session is still there with correct state.
- Fork: fork a session. Assert message history is copied. Modify original. Assert fork is unaffected.
Reattach tests
- Clean reattach within buffer window: disconnect, reconnect within 10 minutes. Assert missed frames are replayed.
- Reattach after buffer overflow: disconnect, wait for buffer to overflow. Assert
resume_failedand client falls back to HTTP re-fetch. - Multi-device conflict: connect from device A, try to connect from device B. Assert 409.
Open questions
- Parallel runs. v0 is one run per session. Should v0.x allow queuing multiple messages (the primary processes them sequentially) or true parallel runs (multiple primaries)? Sequential queuing seems low-cost; true parallel is a big complexity bump.
- Session sharing between operators. v0 is single-operator sessions. v1 may want shared sessions (one operator watches another's session in read-only). Needs auth model extension.
- Session archival. Old sessions accumulate. When do they get archived? v0: never automatically. Operator uses
DELETEto end them. A futurekaged session archive --older-than 30dis plausible. - Subagent suspension fidelity.
SIGTSTPto a caged process works for simple commands but may not work for all programs (some ignore TSTP, some corrupt state). v0 documents the limitation; v0.x may add a "snapshot and kill" checkpoint mode for subagents that can't be suspended. - PTY replay fidelity. Timed replay of PTY transcripts is useful for debugging but the timing markers add storage overhead. v0 stores timestamps per chunk (every ~100ms batch); finer granularity deferred.
- Session memory limit. A session with thousands of messages may exhaust the context window of the primary model. The session manager does not manage context windows (that's the primary's job), but should it enforce a message-count limit? v0: no limit. The primary is responsible for its own context management.
Amendments
2026-05-23 — Per-session model override
Added model_override column to the sessions table and the corresponding modelOverride field to SessionRecord. When set, it is a "provider:model" string that overrides the DSL's primary.model alias for all runs in this session. The daemon's dispatch path checks session.modelOverride before alias resolution — if set, it splits the override into provider + model, resolves the provider's credentials from local config, and constructs the ProviderRoute directly, bypassing alias lookup entirely.
The override is a session-level persistent setting, not per-message. It is set via PUT /api/v1/sessions/:id (see http-api.md) and can be cleared by setting it to null. The UI exposes a model picker in the session input area that reads from the operator's configured providers and their persisted model catalogs.
POST /api/v1/sessions/:id/messages also accepts an optional model_override field in the request body. When present, it is persisted to the session record before dispatch, becoming the session's override for this and all subsequent messages (until changed or cleared). This enables per-message model switching while maintaining the "session-level override" semantic — the last-used model is sticky.
2026-05-27 — ADR-0024: compaction-pending checkpoints and CompactionRecord
Per ADR-0024:
- Checkpoint
created_byextended. New value"compaction"added alongside"operator"and"model". Distinguished by the harness's compaction pipeline whenstrategy: checkpointis configured. Newcompaction_idfield onCheckpointlinks to the pairedCompactionRecord. reason: "compaction_pending"is a distinguished value indicating the checkpoint was created by astrategy: checkpointcompaction event awaiting operator review. The Compactor UI (perui/compactor.md) surfaces these checkpoints differently.- New
CompactionRecordshape defined. Persisted in thecompactionsSQLite table. Fields include trigger, strategy, threshold estimates, message effect (superseded IDs, summary message), plugin participation, cost, fallback tracking, operator feedback (operator_flag,operator_notes), andcheckpoint_idlinkage. Schema migration in@kaged/storageadds the table with indexes onsession_id,run_id,agent_path. - Compaction lifecycle section added documenting how compaction integrates with the session manager: most strategies don't change session state, but
strategy: checkpointcreates a paired checkpoint and pauses the session. Approve / reject flows specified. - Operator feedback persistence.
PATCH /api/v1/sessions/:id/compactions/:cid(perhttp-api.md) writesoperator_flagandoperator_notesto theCompactionRecord. Used by the Compactor UI's history view. - Rollback interaction. Rolling back past a compaction unsuperseses the affected
superseded_message_ids. The summary message (if any) is superseded by the rollback. TheCompactionRecordis preserved withrolled_back: true(a new field). - Compaction limits added — 1000
CompactionRecordrows per session before aggregation; 60s timeout per compaction event with fallback to drop. - Constrained-by list extended with ADR-0024.
2026-06-05 — ADR-0034: session-issue binding, bind/unbind transitions, auto-conclude answer
Per ADR-0034:
bound_issuecolumn added to thesessionstable schema. Nullable FK toissues.id. Stored onSessionRecordasboundIssue: string | null.- Session-issue binding section added. Documents binding semantics: session-side pointer, not a state transition; binding/unbinding does not mutate todos;
kaged.todorequires a binding. - Bind/unbind endpoints documented.
PUT /api/v1/sessions/:id/bindandDELETE /api/v1/sessions/:id/bind. - Session creation with binding.
POST /api/v1/projects/:id/sessionsaccepts optionalbound_issuefield. - Auto-conclude answer. Ending a session never silently resolves its bound issue. The operator confirms at the closing checkpoint. Resolves the standing open question.
- Constrained-by list extended with ADR-0034.
2026-06-05 — Concurrency throttle at message-send: queued state, system-wide running count, operator-explicit resume
The concurrency model is restructured: limits are checked at message-send time, not session-creation time, and exceeded messages queue rather than reject.
- New session state:
queued. When the operator posts a message and the running-session count (per-project or per-operator) is at the limit, the session entersqueuedinstead of rejecting the message. The message is persisted, a run is created inpending, and no primary dispatch occurs. - Concurrency limits moved from creation to message-send.
checkSessionCreationLimitsis removed from session creation. Session creation always succeeds (project must beready). The per-project (4) and per-operator (16) limits are checked inpost_messageinstead. - Operator-explicit resume. The daemon does not auto-dequeue queued sessions when slots free. The operator calls
POST /api/v1/sessions/:id/resumeto start the pending run. This gives the operator control over which queued session gets a freed slot. - Queued message discard. The operator can discard a queued message via
DELETE /api/v1/sessions/:id/queued-message, which cancels the pending run, marks the message superseded, and transitions the session back toidle. - System-wide running count broadcast. Every run start/end broadcasts
sessions.running_counton the system WebSocket (perhttp-api.md§ System WebSocket). The count includes per-project and per-operator breakdowns so the UI can show both the global count and whether a specific session is near its limit. pausedsessions release the slot. A checkpoint-paused session transitionsrunning→pausedand no longer counts against the concurrency limit. This is consistent with the existing checkpoint semantics (work is suspended) and means pausing a session can free a slot for a queued one.- State machine diagram, states table, transitions table, transition side effects, and audit events updated. Three new audit events:
session.queued,session.resumed_from_queue,session.queued_discarded. - "What counts as running" clarified. Only
runningsessions count against the limit.idle,queued,paused,ended, andfaileddo not.
2026-06-05: Sticky todo reminder injection in primary runner
- The primary runner's message pipeline now includes a sticky todo reminder injection point. After spend-limit enforcement and before the LLM call, the runner resolves the session's
bound_issue, loads its todos, and conditionally appends an ephemeralsystem-role message listing open items. The message is not persisted, existing only for that single LLM call. Suppression logic prevents redundancy when akaged.todotool result is already in the preceding turn. Seeagent-tooling.md § kaged.todofor the full specification.
References
- ADR-0002 — web-first, sessions survive disconnects
- ADR-0004 — Bun runtime, PTY integration
- ADR-0005 — storage engine, schema portability
- ADR-0007 — identity threading through sessions
- ADR-0009 — cage mechanism for subagents within sessions
- ADR-0010 — deployment modes, same session semantics
- ADR-0024 — context compaction;
CompactionRecordandcompaction_pendingcheckpoint agent.md § Compaction— the harness-side compaction pipeline this section integrates withhttp-api.md— the API surface the session manager implementsdaemon.md— process model, supervisor, operating limitssandbox.md— cage spawn/teardown called by the session managerplugin-host.md— plugins called during runsproject-dsl.md— DSL that defines session shape (subagents, cages, prompts)local-config.md— alias resolution at session-start