Spec: Daemon

Purpose

This spec defines the kaged daemon as a process: how it starts, how it's configured, where it lives on disk, what it runs in what order, how it shuts down, and how the operator interacts with it via the CLI.

This document is normative for:

  • The daemon's process model — single binary, supervised children, no internal forking.
  • The configuration sources and precedence (env, file, flags).
  • The filesystem layout under KAGED_HOME.
  • The startup self-check sequence, including the security gates from ADR-0007 and ADR-0009.
  • The CLI surface (kaged ...).
  • Logging streams (operational vs audit) and where they go.
  • Crash and restart semantics for the daemon, subagents, and plugins.
  • The systemd integration shape that v0 ships with.

It is not normative for:

This spec is about the runtime container that hosts all of the above.

Constraints (from ADRs)

Constraint Source
Daemon is the lifecycle root — only the init system is its parent ADR-0001
HTTP+WS server is the principal surface; web UI is bundled and served by the daemon ADR-0002
Runtime is Bun; single-binary deploys via bun build --compile ADR-0004
Default storage is SQLite at a file path; Postgres opt-in via URL ADR-0005
Default bind is loopback; sidecar required unless --insecure ADR-0007 and its amendment
Plugins are subprocess children supervised by the daemon ADR-0008
Sandbox is on by default; --no-sandbox is a daemon-level opt-out ADR-0009 and its amendment
Two deployment modes: per-user and system-wide, both first-class ADR-0010
Projects are portable; operator-local concerns live in local config ADR-0011

Deployment mode

The daemon runs in one of two modes, picked at startup (per ADR-0010). The mode determines paths, default auth, and the default systemd unit shape — but not behavior.

Mode detection

At the start of phase 1 (bootstrapping), the daemon resolves its mode in this order:

  1. Explicit override. --mode=user or --mode=system CLI flag, or KAGED_MODE=user|system. If set, that's the mode. (Used by tests; rarely set by operators.)
  2. KAGED_HOME set explicitly. The mode is inferred from the path:
    • Inside /var/lib, /opt, or any path the operator's UID doesn't own → system.
    • Inside $HOME or $XDG_DATA_HOMEuser.
  3. UID and ownership check. Daemon running as a dedicated kaged system user or as UID 0 → system. Daemon running as a regular user → user.
  4. XDG path probe. If $XDG_DATA_HOME/kaged or ~/.local/share/kaged exists → user. If /var/lib/kaged exists and is readable by the daemon user → system.
  5. Default fallback: user. The friendliest default for a fresh install.

The resolved mode is printed at startup: kaged 0.1.0 starting | mode=user | bind=127.0.0.1:38291 | ....

Mode-determined defaults

Default user mode system mode
${KAGED_HOME} $XDG_DATA_HOME/kaged (default ~/.local/share/kaged) /var/lib/kaged
Operational config $XDG_CONFIG_HOME/kaged/config.toml (default ~/.config/kaged/config.toml) /etc/kaged/config.toml
Local config (per local-config.md) $XDG_CONFIG_HOME/kaged/local.toml ${KAGED_HOME}/local/<user>.toml (one file per operator)
Bind 127.0.0.1:<random-free-port> (or fixed in config) 127.0.0.1:7777
Auth mode loopback (cookie-bound nonce; see ADR-0007 amendment) sidecar (header contract with oauth2-proxy or equivalent)
systemd unit ~/.config/systemd/user/kaged.service /etc/systemd/system/kaged.service
Plugin store ${KAGED_HOME}/plugins/ (in this operator's home) ${KAGED_HOME}/plugins/ (shared across all operators of this daemon)
Project registry per-operator in local config per-operator in local config (each operator has their own list)

The defaults are recommendations; every value is overridable in config or by env var. Mode just picks the starting point.

What's identical across modes

The mode is a deployment-shape concern, not a behavior concern. Project authors and plugin authors do not need to know which mode their work will be loaded into.


Process model

The kaged daemon is a single long-lived process.

init system (systemd / launchd)
  └── kaged daemon (one process)
        ├── plugin: oh-my-pi    (supervised subprocess)
        ├── plugin: ollama      (supervised subprocess)
        ├── subagent: scraper   (supervised, in cage)
        ├── subagent: writer    (supervised, in cage)
        └── network gatekeeper  (in-process; not a separate child)

Rules:

  • The daemon does not fork itself. No worker pool, no master+worker, no preforked workers. Bun's concurrency is async; the daemon is single-process.
  • Every long-lived child is supervised. Plugins and subagents are subprocesses managed by named supervisors (PluginSupervisor, SubagentSupervisor) inside the daemon. If a child exits, the supervisor records it and (per restart policy) may respawn.
  • The daemon never execs into another binary. Replacements (upgrades) happen by stopping and starting a fresh process via the init system.
  • No daemon-internal IPC besides JSON-RPC-over-stdio with plugins. The daemon does not open Unix sockets to itself, does not run an internal HTTP loopback, does not use shared memory.

The "long-lived parent" promise from ADR-0001 is mechanical: the daemon has no parent in user space; the init system is its only parent.


Configuration

Sources, in precedence order

  1. CLI flags (highest precedence)
  2. Environment variables (KAGED_*)
  3. Config file (at the mode-appropriate default path; see Configuration file)
  4. Built-in defaults (lowest)

A value at a higher tier silently shadows lower tiers — the daemon does not error on overlap. The effective config is reported by kaged config show for inspection.

Configuration file

Lives at the mode-appropriate default path ($XDG_CONFIG_HOME/kaged/config.toml for user mode, /etc/kaged/config.toml for system mode). TOML for the daemon config (not YAML — the DSL is YAML; using a different format here avoids confusion about which file the operator is editing). TOML is also closer to the "configuration not data" feel of this file.

Auto-creation on first run. If no config file exists at the default path (and no --config flag or KAGED_CONFIG env is set), the daemon creates one with mode-appropriate defaults and logs Config created: <path>. This ensures the operator always has a config file to inspect and edit — no silent defaults.

# ${KAGED_HOME}/config.toml
# Daemon configuration. Reloaded only at restart.
# Generated by kaged on first run with mode-appropriate defaults.

[daemon]
bind = "127.0.0.1:7777"           # listen address
home = "/var/lib/kaged"            # may also be set via KAGED_HOME

[auth]
mode = "secure"                    # "secure" | "insecure"
nonce_file = "/var/lib/kaged/auth-nonce"   # secure mode only

[storage]
url = "sqlite:///var/lib/kaged/kaged.db"   # or "postgres://user@host/db"

[sandbox]
mode = "enabled"                   # "enabled" | "disabled"
default_seccomp = "default"        # see ADR-0009

[logging]
operational = "stderr"             # "stderr" | "file:/path" | "journald"
audit = "file:/var/lib/kaged/audit.log"   # audit log is always file-backed
level = "info"                     # "debug" | "info" | "warn" | "error"

[plugins]
dir = "/var/lib/kaged/plugins"
enabled = ["oh-my-pi", "ollama"]   # subset of installed plugins

[ui]
serve = true                       # set false to disable the UI bundle
url = ""                           # base URL of the UI (for launch URLs); see below

The example above shows system-mode defaults. In user mode, auto-generated configs use ${KAGED_HOME}-relative paths (e.g., ~/.local/share/kaged/kaged.db, ~/.local/share/kaged/audit.log, ~/.local/share/kaged/plugins). Path fields left empty in the config file (or absent entirely) are filled at startup relative to the resolved daemon.home — they never fall back to hardcoded /var/lib/kaged paths.

First-run auto-creation: if no config file exists at the default path and no --config/KAGED_CONFIG override is set, the daemon creates a config file at the default path with mode-appropriate defaults and logs Config created: <path>. This ensures the operator always has an explicit, editable config file from first run — no silent defaults.

ui.url — The base URL where the web UI is reachable. Used to construct the launch URL printed at startup in loopback mode. When the UI runs on a separate process (e.g., a dev server or a tunnel), set this to the UI's origin (e.g., http://127.0.0.1:13001 or https://foo.bar.com). When empty (default), the daemon uses its own bind address — appropriate when the daemon serves the UI bundle itself (ui.serve = true).

The file is parsed at startup. There is no hot-reload. To change config, edit, then systemctl restart kaged (or equivalent).

Environment variables

Every config field has a corresponding env var. Convention: KAGED_<SECTION>_<KEY>, upper-snake-case.

Env var Config path Example
KAGED_HOME daemon.home /var/lib/kaged
KAGED_BIND daemon.bind 127.0.0.1:7777
KAGED_AUTH_MODE auth.mode secure
KAGED_INSECURE shorthand for auth.mode=insecure 1
KAGED_AUTH_NONCE sidecar nonce direct (overrides nonce_file) <random>
KAGED_DATABASE_URL storage.url sqlite:///path
KAGED_SANDBOX_MODE sandbox.mode enabled
KAGED_NO_SANDBOX shorthand for sandbox.mode=disabled 1
KAGED_LOG_LEVEL logging.level info
KAGED_PLUGINS_DIR plugins.dir /var/lib/kaged/plugins
KAGED_UI_URL ui.url http://127.0.0.1:13001

Env vars matching KAGED_* that don't correspond to a known config path are logged as a warning at startup but do not error. Typos surface visibly; forward-compat env vars don't crash old daemons.

CLI flags

CLI flags mirror env vars and take final precedence. Documented per command in CLI surface.

Bun's .env loading

ADR-0004 notes Bun auto-loads .env. For the daemon, this means a .env file in the working directory at startup is read into the process environment before the config layering above runs. This is convenient for development; the production deployment uses systemd EnvironmentFile= instead.

Operational config vs local config

The daemon has two config files with distinct purposes:

File Owns Scope
config.toml (operational) Bind address, storage URL, sandbox mode, log destinations, plugin directory The daemon as a process
local.toml (local config) Model aliases, provider credentials, project registry, operator preferences The operator

This section is about config.toml. Local config has its own spec: local-config.md.

Loading semantics differ. config.toml is read once at startup and frozen for the daemon's lifetime (changes require restart). Local config is read per request, per operator, cached in memory for the active sessions of that operator, and flushed on SIGHUP. In a per-user deployment they collapse to "this one operator's two files." In a system-wide deployment, every operator has their own local.toml while the daemon shares one config.toml.


Filesystem layout

The same layout applies in both modes; only ${KAGED_HOME}'s default path differs (see Deployment mode).

${KAGED_HOME}/
├── kaged.db                       # SQLite database (default storage)
├── kaged.db-wal                   # SQLite WAL file
├── kaged.db-shm                   # SQLite shared-memory file
├── audit.log                      # audit log (append-only, rotates)
├── local/                         # system-mode only: per-operator local configs
│   ├── operator.toml                #   one file per operator who has used this daemon
│   └── bob.toml                   #   in user-mode, local config lives at $XDG_CONFIG_HOME/kaged/local.toml
├── plugins/                       # local plugin store (installed plugins)
│   ├── oh-my-pi/
│   │   ├── kaged-plugin.yaml
│   │   └── run.sh
│   └── ollama/
│       ├── kaged-plugin.yaml
│       └── main.py
├── runtime/                       # ephemeral runtime state
│   ├── cages/                     # cage scratch dirs (one per live invocation)
│   ├── pids/                      # supervisor PID files
│   └── socks/                     # reserved for future use; empty in v0
└── tmp/                           # daemon-managed scratch; cleaned on start

The operational config (config.toml) and, in user mode, the local config (local.toml) live in $XDG_CONFIG_HOME/kaged/ — NOT inside ${KAGED_HOME}. This separates state (data) from config (operator preferences) per XDG conventions.

System-mode equivalents:

/etc/kaged/                         # config
├── config.toml                     # operational daemon config

/var/lib/kaged/                     # state (= ${KAGED_HOME})
├── kaged.db
├── audit.log
├── auth-nonce                      # sidecar-mode shared secret (mode 0600)
├── launch-url                      # current launch URL (mode 0600); updated on token regeneration
├── local/                          # per-operator local configs (one file per operator)
│   ├── operator.toml
│   └── bob.toml
├── plugins/
├── runtime/
└── tmp/

User-mode equivalents:

~/.config/kaged/                   # = $XDG_CONFIG_HOME/kaged
├── config.toml                    # operational daemon config (optional in user mode)
└── local.toml                     # this operator's local config

~/.local/share/kaged/              # = $XDG_DATA_HOME/kaged = ${KAGED_HOME}
├── kaged.db
├── audit.log
├── plugins/
├── runtime/
└── tmp/

$XDG_RUNTIME_DIR/kaged/            # ephemeral; cleared on logout
├── auth-cookie                    # per-startup nonce for loopback auth (mode 0600)
└── launch-url                     # current launch URL (mode 0600); updated on token regeneration

Projects do NOT live under ${KAGED_HOME}. Per ADR-0011, projects are operator-owned directories anywhere on the operator's filesystem. The daemon tracks which projects this operator has opened via the project registry in local config (local-config.md). Each project directory contains .kaged/project.yaml and any prompts and project-scoped data the project needs.

Rules:

  • ${KAGED_HOME} is daemon-owned state. The operator may inspect, back up, and clean it, but the operator does not author files inside it directly (except the daemon config.toml if that's the chosen location for it).
  • runtime/ is ephemeral. Cleaned by the daemon at startup. Operators should not write to it.
  • auth-nonce (system mode) and auth-cookie (user mode) are mode 0600. The daemon refuses to start if it finds them world-readable (and --insecure is not set). This is a real check, not just convention.
  • launch-url is mode 0600. Written by the daemon at startup and on every token regeneration. Contains the full launch URL. CLI commands (kaged auth open) read this file directly — no API call required. The file is ephemeral; it is deleted on daemon shutdown and cleared on logout (user mode, via $XDG_RUNTIME_DIR).
  • The database can live elsewhere. Setting storage.url to a Postgres URL or a SQLite path outside ${KAGED_HOME} is supported; the layout above is the default.

Lifecycle

The daemon's life is divided into five phases. Each phase has explicit entry conditions, observable signals, and a defined failure mode.

Phase 1 — bootstrapping

From process exec to "config loaded, logger working."

  1. Parse CLI flags.
  2. Load env vars.
  3. Resolve deployment mode (from flags/env only — config file not yet loaded).
  4. Discover the config file at the mode-appropriate default path. If no config file exists, create one at the default path with mode-appropriate defaults and log Config created: <path>.
  5. Load config file. Merge per precedence.
  6. Resolve ${KAGED_HOME} (flags > env > config > defaultHome(mode)).
  7. Fill empty path defaults (storage.url, logging.audit, plugins.dir) relative to ${KAGED_HOME}.
  8. Initialize the operational logger.
  9. Emit daemon.bootstrap event to stderr (and audit log once writable).
  10. Print effective mode to stderr: kaged 0.1.0 starting | auth=secure | sandbox=enabled | bind=127.0.0.1:7777.

Failure mode: any error here goes to stderr and exits non-zero. The daemon has not yet bound a port, has not yet opened the database. Restart is safe.

Phase 2 — self_check

Security and integrity gates before opening anything.

In order:

  1. Auth gate (per ADR-0007 amendment):

    • If auth.mode == "secure":
      • Verify nonce_file exists and is mode 0600 owned by the daemon user. If not: refuse to start with a clear error pointing at the file path and the chmod command.
      • Read the nonce into memory. Never log it. Never persist it back.
    • If auth.mode == "insecure":
      • Emit the multi-line CLI warning block to stderr.
      • Emit audit event auth.insecure_mode with bind address.
      • Do not check the nonce file.
  2. Bind-safety gate:

    • If bind is non-loopback (anything other than 127.0.0.1, ::1, or a Unix socket path) AND auth.mode == "secure" AND KAGED_INSECURE_BIND != "1":
      • Refuse to start. The operator must either bind loopback (and front with the sidecar) or set KAGED_INSECURE_BIND=1 to acknowledge the risk.
    • If bind is non-loopback AND auth.mode == "insecure":
  3. Sandbox gate (per ADR-0009 amendment):

    • If sandbox.mode == "enabled":
      • Check that bwrap is on PATH and is a recent-enough version. If not: refuse to start with a message naming the package to install.
      • Check kernel-version baseline for user namespaces (5.10+). If not: refuse to start.
    • If sandbox.mode == "disabled":
      • Emit the no-sandbox CLI warning block.
      • Emit audit event sandbox.disabled.
      • Skip the bwrap/kernel checks.
  4. Storage gate:

    • For SQLite: ensure the parent directory of the db path exists and is writable. Open in WAL mode. Run pending schema migrations. Refuse to start on migration failure with the migration ID and error.
    • For Postgres: connect, version-check, run migrations. Same refusal semantics.
  5. Plugins gate:

    • Walk plugins.dir. For each installed plugin, validate its kaged-plugin.yaml manifest. Plugins with invalid manifests are logged and disabled for this daemon run (operator sees them in kaged plugin list); they do not block daemon startup.
    • Plugin processes are not started yet — that happens in running.
  6. Filesystem gate:

    • Clean runtime/. Create subdirs if missing.

If every gate passes, transition to running. Otherwise, the daemon exits with a clear error and a non-zero exit code mapped to the gate that failed (auth=10, bind=11, sandbox=12, storage=13, plugins=14, filesystem=15). Exit codes are stable; ops tooling may key off them.

Phase 3 — running

The daemon is doing its job.

Entry actions:

  1. Open the HTTP+WS listener on bind.
  2. Write runtime state files (all mode 0600, created in the mode-appropriate runtime directory — $XDG_RUNTIME_DIR/kaged/ in user mode, ${KAGED_HOME}/ in system mode):
    • auth-cookie (user/loopback mode only): the per-session nonce from which the session cookie is derived. Generated once at daemon start; does not change when launch tokens are regenerated. Per ADR-0007 amendment. Log Nonce written: <path> to stderr.
    • launch-url (loopback mode only): the current launch URL ({ui_base_url}/launch?token=<token>). Rewritten whenever the launch token is consumed and regenerated. Also printed to stderr at startup and on each regeneration. The directory is created with mode 0700 if it does not exist. The daemon refuses to start if the directory exists but is not owned by the daemon user.
  3. Mark /readyz ready (per http-api.md).
  4. Spawn each enabled plugin. Failed spawns log and disable the plugin; do not bring down the daemon.
  5. Walk projects/ and load any existing projects. Validate their DSLs; flag invalid ones (visible in API as dsl_status: invalid). Do not auto-start sessions.
  6. Emit audit event daemon.ready.

In this phase:

  • HTTP requests are served.
  • WebSocket connections are accepted.
  • Subagents are spawned on demand by the supervisor.
  • Plugins are running and reachable.
  • The audit log is being written.

Phase 4 — draining

Triggered by SIGTERM. Graceful shutdown.

  1. Mark /readyz not-ready (returns 503). The HTTP listener is still bound — load balancers stop sending new traffic.
  2. Reject new WebSocket upgrades with 503.
  3. Send closing { code: "server_shutdown" } to every connected WebSocket, then close after a 1-second flush window.
  4. Send a "shutdown soon" notice to every live subagent. Wait shutdown_grace_sec (default 10) for them to finish.
  5. SIGTERM any subagents still running. Wait shutdown_kill_sec (default 5).
  6. SIGKILL any survivors.
  7. Send shutdown JSON-RPC notification to every plugin. Wait shutdown_grace_sec for them to exit.
  8. SIGTERM, then SIGKILL remaining plugins.
  9. Close the storage connection (SQLite checkpoints WAL; Postgres releases the connection pool).
  10. Emit audit event daemon.shutdown with reason.

If the daemon receives a second SIGTERM during draining: skip to step 6 (fast shutdown). If it receives SIGKILL: the kernel handles it; the daemon does no cleanup. This is expected to be recoverable — the next startup runs migrations and processes any in-progress runs as failed.

Phase 5 — stopped

Process exit. The init system observes the exit and decides whether to restart per its policy.


Subsystem dependency order

The daemon's subsystems start in a specific order during self_checkrunning:

logger
  ↓
config (loaded, validated)
  ↓
audit log writer (so subsequent events are captured)
  ↓
storage (db connection, migrations)
  ↓
network gatekeeper (in-process; sets up nftables rule templates)
  ↓
subagent supervisor (binds to storage; no children yet)
  ↓
plugin host + plugin supervisor (spawns initial plugin processes)
  ↓
session manager (binds to storage; no sessions active yet)
  ↓
HTTP+WS listener (last; opens the door)

Shutdown reverses the order. The HTTP listener stops accepting new connections first, then plugins, then subagents, then everything else, with the audit log writer last so it captures the shutdown of every other subsystem.

The reason this matters: the daemon never accepts a request it cannot service. If the storage layer isn't up, the listener isn't open.


CLI surface

The kaged binary has two modes:

  1. Daemon mode: kaged start ... runs the long-lived process.
  2. Client mode: every other subcommand makes a local call to the running daemon (via its HTTP API, talking to 127.0.0.1 or the configured loopback bind).

The CLI is plumbing, not a workflow surface (per ADR-0002). Operators use it to start/stop the daemon, inspect state, manage plugins and DSL files, and emit the auth nonce. They do not use it for project work; that's the web UI.

kaged start

Run the daemon in the foreground.

kaged start [flags]

  --config <path>             Path to config.toml (default: ${KAGED_HOME}/config.toml)
  --home <path>               Override ${KAGED_HOME}
  --bind <addr>               Override the listen address
  --insecure                  Bypass auth (per ADR-0007). LOUD WARNINGS.
  --no-sandbox                Disable sandboxing (per ADR-0009). LOUD WARNINGS.
  --insecure-bind             Allow non-loopback bind in secure mode. Required if not --insecure.
  --log-level <level>         debug | info | warn | error
  --foreground                Stay in foreground (default; here for documentation)

systemd unit files invoke kaged start with no --foreground quirk — the daemon is already foreground-only.

kaged status

Print daemon status: version, mode, bind, uptime, project/session counts, plugin status.

kaged status
  daemon: kaged 0.1.0 (pid 12345, up 3h 42m)
  bind:   127.0.0.1:7777
  auth:   secure (sidecar nonce loaded)
  sandbox: enabled (bwrap 0.8.0)
  storage: sqlite:///var/lib/kaged/kaged.db (ok)
  projects: 4 (3 valid, 1 invalid)
  plugins: 2 enabled (oh-my-pi: running, ollama: running)
  warnings: none

In insecure modes, the warnings line is populated and printed in magenta-equivalent terminal color.

kaged config show

Print the effective merged config (after all sources). Useful for debugging precedence.

kaged config show [--source]

--source annotates each value with where it came from (flag, env, file, default).

kaged config validate

Parse the config file and report errors without starting the daemon.

kaged auth nonce

Print the current sidecar nonce to stdout. Used by sidecar configuration tooling.

kaged auth nonce
  <printed to stdout — no trailing newline interpretation required>

Reads from nonce_file (or env). In --insecure mode, prints nothing and exits non-zero with a message that no nonce exists. Reading this requires read access to the nonce file; operators run it as the daemon user.

kaged auth rotate

Generate a new nonce, write it to nonce_file, and signal the running daemon to reload it.

kaged auth rotate
  ✓ new nonce written to /var/lib/kaged/auth-nonce
  ✓ daemon reloaded (SIGHUP)
  → reconfigure your sidecar with the new value

SIGHUP is the daemon's "reload nonce only" signal. Nothing else is reloaded by SIGHUP; the rest of config requires a restart.

kaged auth open

Open the current launch URL in the operator's default browser. Used to authenticate a new browser session without copy-pasting the URL from daemon logs.

kaged auth open
  → opening http://127.0.0.1:38291/launch?token=abc123...

Reads the launch URL from the runtime state file ($XDG_RUNTIME_DIR/kaged/launch-url in user mode, ${KAGED_HOME}/launch-url in system mode). Calls xdg-open (Linux) or open (macOS) with the URL. No daemon API call is made — this is a pure file read + subprocess spawn.

Failure modes:

  • No launch-url file exists → exit non-zero with No running daemon found (missing launch-url file).
  • --insecure mode → exit non-zero with No launch URL in insecure mode (auth is disabled).
  • xdg-open / open not found → exit non-zero with Could not open browser: xdg-open not found.

The command does not consume the launch token — it merely opens the URL. The browser visit consumes it. Existing browser sessions are unaffected by token regeneration; the session cookie remains valid.

kaged plugin list / install / enable / disable / logs

Per ADR-0008:

  • kaged plugin list — table of installed plugins, status, last error.
  • kaged plugin install <path> — copy a plugin directory into plugins.dir, validate the manifest. Does not auto-enable.
  • kaged plugin enable <name> / disable <name> — toggle in config (writes to config.toml) and signals daemon.
  • kaged plugin logs <name> — tail stderr for the named plugin.

kaged dsl validate / migrate / schema

Per project-dsl.md CLI surface:

  • kaged dsl validate <path> — parse and validate a DSL file. Exits non-zero on failure.
  • kaged dsl migrate <path> --to <version> — schema-version migration.
  • kaged dsl schema [--version N] — print published JSON Schema.

These commands work without a running daemon (they're pure file operations); they don't make HTTP calls.

kaged backup and kaged restore

Per ADR-0005:

  • kaged backup [--output <path>] — produce a backup of the database, prompts, and projects. For SQLite, runs .dump against a consistent snapshot. For Postgres, runs pg_dump.
  • kaged restore <path> — restore from a backup. Refuses to run with a daemon active; the operator must stop the daemon first.

kaged audit

Tail or query the audit log directly (without going through the HTTP API).

kaged audit tail                   # follow
kaged audit query --since 1d       # last 24h
kaged audit query --event-type 'subagent.spawn.uncaged'

kaged version

Print version. The most boring command; included because every CLI needs it.

kaged help

Top-level help. Subcommand help via kaged <cmd> --help.


Logging

The daemon writes two logically distinct streams. Per ADR-0007 and the manifesto, the audit log is load-bearing for operator trust; the operational log is for debugging.

Operational log

Free-form structured logs. Default destination: stderr. Configurable to a file or journald.

  • Format: newline-delimited JSON ({"ts":..., "level":..., "msg":..., ...fields}) when destined for a file or journald; human-friendly text when destined for stderr in a TTY.
  • Levels: debug, info, warn, error. Default: info.
  • Contents: request lines, plugin spawn/exit, supervisor decisions, LLM-provider errors, daemon lifecycle events.
  • Not for audit. The operational log may be discarded, rotated by external tools, or sent to a remote collector. Nothing here is considered a record-of-truth.

Audit log

Append-only record of every load-bearing event.

  • Destination: file (audit.log in ${KAGED_HOME} by default). The audit log is always file-backed — never stderr only — because losing it on a process crash is unacceptable.
  • Format: newline-delimited JSON, one event per line. Field schema documented in http-api.md audit endpoint.
  • Append-only. The daemon never rewrites or deletes audit entries. Log rotation, if configured, archives old files but they remain readable.
  • fsync policy: the daemon fsyncs the audit log after every write. Slow but correct. Operators who want batched fsync can set logging.audit_sync = "interval:1s" in config (not recommended).
  • Event taxonomy (initial; extensible):
    • daemon.bootstrap, daemon.ready, daemon.shutdown, daemon.crash
    • auth.success, auth.failure, auth.insecure_mode, auth.nonce_rotated
    • sandbox.disabled
    • project.created, project.dsl_updated, project.deleted
    • session.created, session.attached, session.detached, session.ended
    • run.started, run.ended, run.cancelled
    • subagent.spawn, subagent.spawn.uncaged, subagent.exit, subagent.killed
    • checkpoint.created, checkpoint.resumed, checkpoint.rollback
    • prompt.edit
    • plugin.spawned, plugin.exit, plugin.crashed, plugin.enabled, plugin.disabled
    • policy.violation (any time a cage limit is hit or a request is denied)

Every audit event carries request_id when applicable, the operator's user_id (or insecure-mode), and a millisecond timestamp.


Supervisor behavior

The daemon hosts named supervisors. Each owns a class of children.

PluginSupervisor

Owns plugin subprocesses (per ADR-0008).

  • Spawn: at daemon-ready, walks the enabled-plugins list, spawns each.
  • Restart policy: exponential backoff (1s, 2s, 4s, 8s, capped at 60s) on crash. After 5 consecutive failures within 10 minutes, the plugin is marked failed and disabled; operator re-enables via kaged plugin enable.
  • Health: the supervisor sends ping JSON-RPC every 30s. No response in 90s → kill and restart.
  • Shutdown: sends shutdown JSON-RPC; SIGTERM after shutdown_grace_sec; SIGKILL after shutdown_kill_sec.

SubagentSupervisor

Owns subagent invocations (per ADR-0009).

  • Spawn: on demand from the session manager. Compiles the cage policy from the DSL, sets up the network namespace and gatekeeper rules, then bwraps the subagent process. If cage: disabled or --no-sandbox, spawns as the daemon user directly.
  • No automatic restart. A failed subagent stays failed. The next operator message can retry.
  • Resource enforcement: cgroup limits applied at spawn; on limit breach, the supervisor kills and emits policy.violation.
  • Walltime: the supervisor enforces the walltime_sec from the DSL with its own timer. On expiry: SIGTERM, then SIGKILL after 5s.
  • Shutdown: during daemon draining, the supervisor SIGTERMs every live subagent and waits. Subagents that ignore the signal are SIGKILLed.

SessionSupervisor

Owns session lifecycle. Detailed semantics live in session-manager.md. At daemon level:

  • Sessions survive operator disconnects (per ADR-0002).
  • The daemon persists session state continuously (not just on shutdown). A SIGKILL of the daemon loses no committed work — any uncommitted in-flight reasoning is marked as a failed run on next startup.

Project loading

Per ADR-0011, projects are operator-owned directories tracked through the project registry in local config (local-config.md). The daemon does NOT discover projects by scanning the filesystem; it knows about a project only after the operator has explicitly loaded it.

What "loading a project" means

When the operator invokes POST /api/v1/projects/load (or kaged project load <path>):

  1. The daemon reads <path>/.kaged/project.yaml. If absent or unreadable → dsl_invalid error with details.reason: "no_project_yaml".
  2. The daemon validates the DSL (project-dsl.md). On failure → dsl_invalid with line/col details.
  3. The daemon resolves the calling operator's local config (per local-config.md).
  4. The daemon collects:
    • Every alias referenced by primary.model and subagents.<name>.model.
    • Every plugin in plugins.
    • Every prompt file referenced by *.system_prompt.
    • Every path referenced by cage.fs[].path.
  5. The daemon checks each against the operator's local config and the project directory:
    • Alias is in [aliases]? If not → pending, add to unresolved-aliases list.
    • Plugin registry entry's package is registered in the daemon's project-plugin supervisor? If not → pending, add the slot name plus { package, source, status: "missing" } to the missing-plugins list.
    • Prompt file exists at <project-root>/<path>? If not → pending, add to missing-prompts list.
  6. The daemon writes the project to the registry (or updates the existing entry) with state ready, pending, or invalid.
  7. The daemon returns the project status and the lists of unresolved items.

The UI then walks the operator through resolution: defining missing aliases, installing missing plugins (with the install prompt per ADR-0008 amendment), or asking the operator to fix missing prompts on disk.

Re-evaluation triggers

A project's state is recomputed when:

  • The operator edits local config (alias added, plugin installed, etc.) — the daemon recomputes the state of every registered project that mentioned the affected name.
  • The operator edits the project's DSL on disk (detected by mtime check on the next API call that references the project, or explicitly via POST /api/v1/projects/:id/dsl per http-api.md).
  • The daemon restarts. Every registered project is re-evaluated as part of the running phase entry.

State changes emit project.state_change audit events with the old and new state.

Hot-reload

DSL edits applied to a registered project hot-reload at the next session-start, not immediately. Active sessions continue with the DSL they were started under. The UI shows a "this project's DSL changed; restart sessions to pick up the change" indicator.

Hot-reload is intentionally conservative — a subagent mid-task should not have its cage policy change underneath it. Operators wanting immediate apply use the session UI's "restart with new DSL" action.

Loaded vs unloaded projects

A project on disk is a project; a project that has been kaged project load-ed is a registered project for this operator. Sessions can only start against registered projects. The same project directory can be registered by multiple operators on a shared system-wide daemon (each gets their own entry in their own local config; the underlying directory is shared but their alias resolutions and state are per-operator).

Unloading

DELETE /api/v1/projects/:id (per http-api.md) removes the project from this operator's registry and ends any active sessions for it. It does not delete files on disk. The operator can re-load it later.

A kaged project forget <id> CLI shorthand is equivalent.


Crash semantics

When a plugin crashes

  • The plugin host detects EOF on stdin/stdout.
  • PluginSupervisor records plugin.crashed with the exit code and last stderr.
  • Restart per backoff policy.
  • API calls that were mid-flight to the plugin return 502 provider_unreachable.
  • Other plugins are unaffected.
  • The daemon stays up.

When a subagent crashes

  • The supervisor detects the process exit.
  • The relevant run is marked failed. WS subscribers see subagent.end with non-zero exit.
  • The cage is torn down (network namespace cleaned, scratch dir wiped if ephemeral).
  • The primary may attempt to handle the failure per its prompt; the daemon does not auto-retry.
  • Other subagents and the daemon are unaffected.

When the daemon itself crashes

  • The init system (systemd) decides whether to restart. The blessed unit file sets Restart=on-failure.
  • On restart, the daemon runs self_check again. Existing data on disk is consistent (SQLite WAL guarantees atomicity for committed transactions).
  • Any subagents and plugins that were alive at crash time are orphaned. The next self_check cleans runtime/, which includes their PID files; the daemon does NOT kill orphaned processes on startup (they may have detached for legitimate reasons, like a deploy step the operator wants to outlive the daemon). Operators see them as "untracked processes" in kaged status and can clean them manually.
  • The audit log records daemon.crash (from the recovering instance, not the crashing one — the crashing instance is by definition unable to write its own crash event reliably).

When the host loses power

  • WAL replay on next SQLite open recovers committed transactions.
  • The audit log is fsync'd; the last committed line survives.
  • The daemon comes up via systemd; runs self_check; resumes.

systemd units

Two unit files ship as documentation. Operators install the one matching their deployment mode (per ADR-0010).

System-wide unit (/etc/systemd/system/kaged.service)

examples/deployment/systemd/kaged.service:

[Unit]
Description=kaged daemon
Documentation=https://kaged.dev
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=kaged
Group=kaged
EnvironmentFile=-/etc/kaged/env
ExecStart=/usr/local/bin/kaged start
Restart=on-failure
RestartSec=5s
KillMode=mixed
TimeoutStopSec=30s

# Hardening (systemd-side, complementing kaged's own sandbox)
NoNewPrivileges=yes
ProtectSystem=strict
ReadWritePaths=/var/lib/kaged
ProtectHome=yes
PrivateTmp=yes
ProtectKernelTunables=yes
ProtectKernelModules=yes
ProtectControlGroups=no                 # the daemon needs cgroup access for subagent limits
RestrictNamespaces=user pid net mount   # the daemon needs these to set up cages

[Install]
WantedBy=multi-user.target

Notes:

  • User=kaged runs the daemon as a non-root user. Sandbox features (user namespaces) work fine; the operator does not run kaged as root.
  • ReadWritePaths=/var/lib/kaged matches ${KAGED_HOME}.
  • RestrictNamespaces is permissive enough that the daemon can create namespaces for cages. Tightening this further breaks bwrap.
  • The OAuth sidecar (oauth2-proxy or equivalent) ships as a separate unit file (oauth2-proxy@kaged.service), out of scope for this spec.

launchd (macOS) and OpenRC (Alpine) equivalents will land in v0.x as documentation, not v0.

Per-user unit (~/.config/systemd/user/kaged.service)

examples/deployment/systemd-user/kaged.service:

[Unit]
Description=kaged daemon (per-user)
Documentation=https://kaged.dev
After=default.target

[Service]
Type=simple
EnvironmentFile=-%h/.config/kaged/env
ExecStart=%h/.local/bin/kaged start
Restart=on-failure
RestartSec=5s
KillMode=mixed
TimeoutStopSec=30s

# No system-level hardening directives are needed (the user's namespace
# already constrains the process). bwrap inside kaged handles cage isolation.

[Install]
WantedBy=default.target

Install/enable:

mkdir -p ~/.config/systemd/user
cp examples/deployment/systemd-user/kaged.service ~/.config/systemd/user/
systemctl --user daemon-reload
systemctl --user enable --now kaged
# To make kaged run when the operator is logged out:
loginctl enable-linger "$USER"

Notes:

  • The per-user unit runs as the operator's UID. No User= or Group= directive; systemd handles it via the --user instance.
  • %h is systemd's expansion for the user's home directory.
  • Hardening directives (ProtectSystem, ReadWritePaths, etc.) are intentionally omitted from the user unit. The kernel's user-namespace boundary plus bwrap inside kaged is the isolation; replicating systemd-side hardening adds friction for limited benefit when the daemon is already constrained by being non-root.
  • loginctl enable-linger is required for the daemon to run when the operator is logged out (otherwise systemd kills user sessions on logout). Documented; not hidden.
  • No sidecar is installed in this deployment. The daemon uses loopback + cookie-bound nonce auth (per ADR-0007 amendment).

Mixed deployments

An operator running a per-user kaged on their workstation AND interacting with a system-wide kaged on a homelab box is supported — they're independent daemons reachable at different addresses. The CLI can target a specific daemon via KAGED_BIND or --daemon; absent that, it talks to the per-user one if running, else system-wide. (Detailed CLI targeting rules are in CLI surface.)


Migrations

Database schema migrations are applied automatically during self_check.

  • Format: SQL files in packages/daemon/migrations/ named NNNN_description.sql, where NNNN is a zero-padded sequence number.
  • Engine portability: every migration has either one SQL file (when portable) or two (NNNN_description.sqlite.sql, NNNN_description.postgres.sql) per ADR-0005.
  • Tracking: a schema_migrations table records the applied migration IDs and timestamps.
  • Failure handling: on migration error, the daemon refuses to start and the migration's transaction is rolled back. The daemon does not partially apply migrations.

The daemon does not support down-migrations. To revert a schema, the operator restores from a backup made before the bad migration. This is documented as a feature, not a bug — automatic down-migrations are dangerous and we don't want operators to develop the reflex.


Resource budgets (v0 defaults)

Resource Limit Where enforced
Memory per subagent cage 256 MB cgroups (configurable via DSL cage.limits.memory_mb)
Walltime per subagent 600s supervisor timer
Concurrent subagents per session 8 session supervisor (rejects with 409 if exceeded)
Concurrent plugins unlimited (within process FD limits) none
Audit log file size 100 MB before rotation logger
Operational log file size 50 MB before rotation logger
WebSocket buffer per channel per http-api.md session manager
Database connections 1 (SQLite), pool of 10 (Postgres) storage layer

These are operator-tunable in config.toml under their respective sections. The defaults are sized for a low-resource Linux host (e.g., 2GB RAM).


Testing notes

Per ADR-0003:

  • Self-check tests: each gate (auth, bind, sandbox, storage, plugins, filesystem) has at least one test asserting the exit code and stderr message.
  • Config precedence tests: every overlap between sources is exercised — flag overrides env, env overrides file, file overrides default.
  • Lifecycle tests: the daemon starts, reaches running, drains on SIGTERM cleanly. Forced SIGKILL leaves the database consistent.
  • Supervisor tests: plugin crash → backoff and restart. Subagent crash → cage cleanup. Resource limit breach → policy.violation audit event.
  • Migration tests: every migration has a forward test against a fixture database. Portability tests run the same migrations against SQLite and Postgres CI containers.
  • CLI tests: every subcommand exercises a happy path and a documented failure mode.
  • Audit log tests: every event type listed above is producible by an integration test. The fsync guarantee is validated by a power-loss simulator (kill -9 the daemon mid-write, restart, assert the last committed event is durable).

Open questions

  1. Multi-tenant readiness. Today the daemon assumes one operator (or a trusted group authed through one sidecar). v2 will add per-operator scoping. The audit log already includes user_id; the rest is RBAC work.
  2. Cluster mode. Cross-daemon mesh (a kaged at home talking to a kaged in the office) was sketched in ADR-0001 and the vision doc. v0 is single-daemon; this spec doesn't preclude clustering but doesn't enable it either.
  3. Hot-reload of plugins. Today, enabling a plugin in config.toml requires a daemon restart. kaged plugin enable works without restart by signaling the daemon to spawn the plugin process; full config-driven hot-reload is deferred.
  4. Resource autoscaling. No automatic memory/walltime adjustment per workload. Operators tune defaults globally and per-cage. Reasonable for v0.
  5. macOS support. Daemon process model works fine on macOS (launchd, bsd kqueue). The blocker is the sandbox layer (bwrap is Linux). v0 is Linux-only; macOS is "kaged works in a Linux VM."

Amendments

2026-05-30 — Antigravity provider auth module

Per ADR-0028:

  1. New runtime module antigravity-auth/. The daemon gains an internal module at packages/daemon/src/runtime/antigravity-auth/ that owns the Antigravity provider's OAuth lifecycle: PKCE-based authorization code grant, token exchange, persistent token storage ($XDG_CONFIG_HOME/kaged/antigravity-tokens.json), proactive token refresh, and integration with the existing resolveCredentials() flow.
  2. Credential resolution extension. resolveCredentials() in primary-runner.ts now checks the Antigravity token store before falling back to local.toml access_token fields. Resolution order: daemon token store (fresh) → local.toml static token → null (unresolved).
  3. Three new HTTP endpoints. POST /login, GET /status, POST /logout under /api/v1/local/providers/antigravity/auth/. See http-api.md § Antigravity provider OAuth for the contract.
  4. Token store. Zod-validated JSON at $XDG_CONFIG_HOME/kaged/antigravity-tokens.json, mode 0600. Contains refresh token, access token, expiry, email, and project ID. Atomic writes via temp file + rename.

2026-05-21 — Deployment modes + project-load flow + local-config split

Significant amendment driven by ADR-0010 and ADR-0011:

  1. New "Deployment mode" section before "Process model." Defines per-user vs system-wide mode detection at startup, the mode-determined defaults table, and the "what's identical across modes" guarantee.
  2. Filesystem layout split by mode. System mode still uses /var/lib/kaged for state and /etc/kaged/ for operational config; user mode uses XDG paths ($XDG_DATA_HOME/kaged for state, $XDG_CONFIG_HOME/kaged/ for config, $XDG_RUNTIME_DIR/kaged/ for the loopback auth cookie). Per-operator local configs added under ${KAGED_HOME}/local/ in system mode.
  3. New "Project loading" section before "Crash semantics." Defines the project-load endpoint flow (validate → resolve aliases → check plugins → register), state re-evaluation triggers, hot-reload conservatism, and unloading.
  4. Per-user systemd unit added alongside the existing system unit, with loginctl enable-linger documented for logged-out operation.
  5. "Operational config vs local config" subsection added to Configuration to disambiguate the two files and their loading semantics.

The spec is also now constrained by ADR-0010 and ADR-0011 (added to the frontmatter).

2026-05-24 — Streaming-first: events channel publishing, abort controller registry

Per ADR-0016:

  1. Events channel publishing from dispatchPrimary. The daemon now publishes lifecycle events on the events WebSocket channel — run.started when a run begins processing, run.ended (with outcome) when a run completes, fails, or is cancelled. These events trigger query cache invalidation in the UI so session state and message lists stay current without manual refresh. The ws-registry module gains a publishSessionEvent function alongside the existing publishHarnessEvent.
  2. Abort controller registry. The daemon maintains a per-run AbortController registry (activeRuns map in primary-runner.ts). When dispatchPrimary starts a run, it registers the controller; when the run ends (any outcome), it deregisters. The existing POST /sessions/:id/runs/:rid/cancel endpoint looks up the controller and calls .abort(), propagating cancellation through the harness to the LLM provider's SSE stream. This gives operators immediate abort capability.
  3. Message ordering. listMessages now orders by created_at ASC (previously id ASC). ULID lexicographic order and creation-time order can diverge when messages are created across async boundaries; created_at is the authoritative timeline.

2026-05-23 — Per-session model override dispatch

  1. Model override in dispatch path. dispatchPrimary now checks session.modelOverride before alias resolution. When set, it splits the override ("provider:model") into provider name and model ID, resolves the provider's credentials from local config, and constructs the ProviderRoute directly — bypassing alias lookup entirely. When modelOverride is null, the existing alias resolution path is used unchanged.
  2. Override persistence via handlePostMessage. When POST /sessions/:id/messages includes a model_override field, the session record is updated with the override before dispatch begins. This makes the override sticky — subsequent messages use it until changed or cleared.
  3. Override persistence via handleUpdateSession. PUT /sessions/:id now accepts model_override alongside label. The operator (or UI) can set, change, or clear (null) the override without posting a message.

2026-05-22 — UI URL configuration + launch token regeneration

  1. New ui.url config key added to the [ui] section. Specifies the base URL where the web UI is reachable, used to construct launch URLs in loopback mode. When the UI runs on a separate origin (dev server, tunnel, reverse proxy), the operator sets this to the UI's origin. When empty (default), the daemon uses its own bind address — the correct default when the daemon serves the UI bundle itself.
  2. New KAGED_UI_URL env var added to the env var table. Overrides ui.url per standard precedence (env > config > default).
  3. Launch URL uses the UI base URL. The launch URL printed at startup is {ui_base_url}/launch?token=<token>, where ui_base_url is resolved from KAGED_UI_URL > ui.url > http://{bind}. This points to the UI's /launch route, which handles the token exchange via JSON content negotiation with the daemon's API. The operator's browser must reach the UI origin, not the daemon directly.
  4. Launch token regeneration after invalidation. When the one-time launch token is consumed (operator visits the launch URL), the daemon generates a new token and logs a new launch URL to the operational log. This ensures the operator can always re-authenticate from a new browser without restarting the daemon. The previous session cookie remains valid; the new token is for new browser sessions only.

2026-05-22 — Config auto-creation, home-relative paths, launch URL fix

  1. Config auto-creation on first run. When no config file exists at the mode-appropriate default path and no --config/KAGED_CONFIG override is set, the daemon creates a config file with mode-appropriate defaults (home-relative paths for storage, audit log, plugins dir, and the correct bind address). Logs Config created: <path>. No more silent defaults — every daemon run has an explicit, editable config file.
  2. Home-relative path defaults. storage.url, logging.audit, and plugins.dir now default relative to ${KAGED_HOME} instead of hardcoding /var/lib/kaged. When these fields are empty (or absent) in the config file, they resolve to sqlite://${home}/kaged.db, file:${home}/audit.log, and ${home}/plugins respectively. Explicitly set values are never overwritten.
  3. Bootstrap phase restructured. Mode is now resolved before config file discovery (from flags/env only). This eliminates the chicken-and-egg problem where the config file path depends on mode but mode might depend on config. The config file is loaded (or created) after mode and home are known.
  4. Launch URL path fixed. Launch URLs now point to /launch?token=<token> (the UI route) instead of /api/v1/launch?token=<token> (the daemon API endpoint). The UI's /launch route handles token exchange via JSON content negotiation with the daemon's API — the operator's browser should never hit the daemon's API directly.

2026-05-22 — Runtime state files, kaged auth nonce, kaged auth open

  1. Nonce file written at startup. Per ADR-0007 amendment, the daemon now writes the per-session nonce to a file at startup. In user mode: $XDG_RUNTIME_DIR/kaged/auth-cookie (mode 0600). In system mode: ${KAGED_HOME}/auth-nonce (mode 0600). The nonce is generated once per daemon lifetime and does not change when launch tokens are regenerated. The file path is logged to stderr: Nonce written: <path>.
  2. Launch URL file written at startup and on regeneration. The daemon writes the current launch URL to a file alongside the nonce: $XDG_RUNTIME_DIR/kaged/launch-url (user mode) or ${KAGED_HOME}/launch-url (system mode), mode 0600. The file is rewritten whenever a launch token is consumed and regenerated. CLI commands read this file directly — no daemon API call required.
  3. kaged auth nonce implemented. Reads the nonce directly from the nonce file (no API call). Prints to stdout. Exits non-zero in insecure mode. Per the existing CLI surface spec.
  4. kaged auth open added. New CLI command that reads the launch URL from the runtime state file and opens it in the operator's default browser via xdg-open (Linux) or open (macOS). No API call. Exits non-zero if no launch-url file exists or in insecure mode.
  5. CLI subcommand routing expanded. The daemon binary now dispatches kaged auth <subcommand> in addition to kaged start. The auth subcommands (nonce, open) are pure file reads + local actions — they do not require a running daemon's HTTP API.
  6. Filesystem layout updated. launch-url added to both user-mode ($XDG_RUNTIME_DIR/kaged/) and system-mode (${KAGED_HOME}/) layouts. Both files are mode 0600.
  7. Runtime directory ownership check. The daemon creates $XDG_RUNTIME_DIR/kaged/ (mode 0700) if it does not exist. If the directory exists but is not owned by the daemon user, the daemon refuses to start.

References

  • ADR-0001 — kaged as lifecycle root
  • ADR-0002 — web UI is bundled and served by the daemon
  • ADR-0004 — Bun runtime, bun build --compile
  • ADR-0005 — SQLite default, Postgres opt-in, portable migrations
  • ADR-0007 and its amendment — sidecar contract, --insecure
  • ADR-0008 — plugin host subprocess model
  • ADR-0009 and its amendment — sandbox enforcement, --no-sandbox
  • ADR-0028 — Antigravity provider OAuth lifecycle
  • http-api.md — the surface this daemon exposes
  • project-dsl.md — the contract this daemon parses
  • session-manager.md — internal session state machine
  • sandbox.md — cage compiler and network gatekeeper
  • plugin-host.md — plugin JSON-RPC protocol