ADR-0029: Structured operational logging

  • Status: Proposed
  • Date: 2026-05-31
  • Deciders: @karasu
  • Supersedes:
  • Superseded by:

Context

kaged has three logging-shaped things that don't talk to each other:

  1. @kaged/utils/logger.ts — a JSON-structured, daily-rotating file logger. Supports levels (debug, info, warn, error), configurable retention, console mirroring. Exists, but the daemon does not use it — every daemon module writes raw console.error() to stderr instead (42 calls across main.ts and task-recovery.ts).

  2. Audit log (/api/v1/audit) — an append-only event trail stored in SQLite (audit_events table). Covers auth events, plugin installs, compaction lifecycle, prompt edits. Queried by the UI's audit screen. This is not operational logging — it answers "what happened?" at the policy level, not "what went wrong?" at the runtime level.

  3. UI log drawer — a slide-up panel with filter chips (daemon, session, subagent, audit) and a LogEntry type. Currently empty at runtime — the drawer renders const entries: LogEntry[] = [] because there is no API endpoint feeding it. The component structure is ready; the data pipeline is not.

The pain

When compaction fails with {"error":{"code":"internal","message":"Failed to run compaction."}}, there is no way for the operator to see why without SSH-ing into the host and reading tmux output. The error was swallowed because the daemon has no structured logging — only scattered console.error calls that vanish into the terminal scrollback.

ADR-0013 says "kaged ships sane defaults if Langfuse is not configured: structured logs to stdout (and to the kaged log viewer in the UI), enough to debug most runs without external infra." This commitment exists on paper but not in code. This ADR delivers it.

What this ADR is not

  • Not a replacement for Langfuse tracing (ADR-0013). Operational logs answer "what is the daemon doing / what went wrong?" Langfuse answers "what did the model do, with what tokens, at what cost?" Different data, different consumers, different retention.
  • Not about audit logging. Audit events are policy-level (who did what). Operational logs are runtime-level (what happened inside the daemon). Both coexist.
  • Not about UI emitting logs. The UI is a consumer of logs via the log drawer. UI-side log emission is deferred.

Decision

The daemon emits structured operational logs to two sinks: SQLite (queryable by the UI via HTTP API) and rotating flat files (survivable across DB resets, grep-friendly). The existing @kaged/utils/logger.ts is adopted as the file sink. A new logs table in SQLite provides the queryable sink. Plugins log through a structured log method on their JSON-RPC channel; the daemon captures plugin stderr as unstructured daemon logs with plugin context. Retention and level are configurable in local.toml.

Sink 1: SQLite logs table

Operational logs go into a logs table alongside sessions, runs, and audit events. This gives the UI a paginated, filterable, full-text-searchable log source without extra infrastructure.

CREATE TABLE IF NOT EXISTS logs (
  id          TEXT PRIMARY KEY,    -- ULID
  ts          INTEGER NOT NULL,   -- epoch ms
  level       TEXT NOT NULL,      -- debug | info | warn | error
  source      TEXT NOT NULL,      -- daemon | plugin | session | subagent
  message     TEXT NOT NULL,
  project_id  TEXT,               -- nullable: daemon-level logs have no project
  session_id  TEXT,               -- nullable: non-session logs have no session
  context     TEXT,               -- JSON blob for structured fields
  plugin_name TEXT                -- nullable: set when source = 'plugin'
);

Index on (level, ts) for filtered queries, and (project_id, ts) for project-scoped views.

Retention: prune rows older than the configured retention window on daemon boot (background task, not blocking startup). Default: 7 days, max 10 000 rows (whichever limit is hit first).

Sink 2: Rotating flat files (existing @kaged/utils/logger.ts)

Adopt the existing file logger as-is. The daemon configures it on boot from local.toml [logging] section. File logs are the survival copy — they outlive DB resets, can be grep'd, and can be shipped to external collectors (Loki, Datadog) by the operator.

Both sinks receive the same entries. File logs are the authoritative record; SQLite is the queryable index for the UI.

Levels and defaults

Level Meaning Production default Development default
error Something failed, operator action may be needed Always on Always on
warn Something unexpected but recovered Always on Always on
info Normal operational events (startup, shutdown, plugin loaded, session created) Always on Always on
debug Detailed internals (hook firing, tool registration, context resolution) Off On

Production default minimum level: warn (7 days, 10k entries). Development default minimum level: debug (7 days, 50k entries) — bun test is not affected; this is the running daemon's log level.

The daemon checks NODE_ENV or a KAGED_ENV env var. If "development", debug is enabled. Otherwise, production defaults apply. The operator can override both via local.toml.

Sources (log categories)

The source field maps to the UI's existing LogFilterKind. Existing kinds are retained and extended:

Source What emits it UI filter chip
daemon Daemon core: startup, shutdown, gates, config loading, internal errors daemon
plugin Plugin lifecycle: load, hook firing, tool registration, errors. Includes plugin stderr captures. daemon (or a future plugin chip)
session Session lifecycle: create, state transitions, compaction, idle session
subagent Subagent invocations: spawn, cage setup, exit, errors subagent
audit Audit events (already served by /api/v1/audit) audit

The audit source is special: audit events continue to flow through the existing /api/v1/audit endpoint and the audit_events table. The log viewer can show audit events alongside operational logs, but they are stored in their own table with their own schema. The source: "audit" entries in the logs table are lightweight references (not duplicates) — or the UI can query both tables and merge by timestamp. The spec will settle this.

Plugin logging

Project plugins (subprocesses, ADR-0008) get a structured log method in their JSON-RPC protocol:

{
  "jsonrpc": "2.0",
  "method": "log",
  "params": {
    "level": "error",
    "message": "Failed to preserve messages during compaction",
    "context": { "compaction_id": "01JX...", "retained_count": 0 }
  }
}

The daemon writes these to both sinks with source: "plugin" and plugin_name set.

Plugin stderr is captured line-by-line and written as source: "daemon" logs with plugin_name set and context.capture: "stderr". This catches unstructured errors from plugin processes without the plugin needing to use the structured protocol.

System plugins (in-process, @kaged/plugin-types) already have PluginLogger in their context. That interface is wired to the same dual-sink pipeline.

Configuration in local.toml

New [logging] section:

[logging]
level = "warn"           # minimum level: debug | info | warn | error
retention_days = 7       # prune logs older than this
max_entries = 10000      # prune oldest rows when exceeded
dir = "/var/log/kaged"   # override file log directory (optional)
console = true           # mirror to stderr (default: false in production)

All fields optional. Defaults applied when absent.

HTTP API

New endpoints for the log drawer:

  • GET /api/v1/logs — global daemon logs (no project scope)
  • GET /api/v1/projects/:id/logs — project-scoped logs
  • GET /api/v1/sessions/:id/logs — session-scoped logs

Query parameters: level, source, since (epoch ms), until (epoch ms), q (string search on message), limit (default 100, max 500), cursor (ULID-based pagination).

Response: { "entries": LogEntry[], "cursor": string | null } — most recent first, cursor-based pagination for the UI's "load more on scroll" pattern.

UI log drawer behavior

  • Opening the drawer requests the last N entries (via limit) for the current scope (project or session). Most recent at the top — no initial scroll.
  • Scrolling to the bottom triggers a load-more request using the cursor from the previous response.
  • Adding/changing a filter re-requests with the filter applied, maintaining the N-entry window.
  • String search is server-side (q parameter) — the UI sends the query, the daemon does LIKE or FTS on message.
  • Real-time updates: initially request/response. A future iteration can add a WebSocket subscription for live log tailing. Not in scope for this ADR.

Consequences

What this commits us to

  • Migrating all 42 console.error calls in the daemon to structured logger calls. Mechanical but noisy.
  • A logs table in the storage schema (bumps SCHEMA_VERSION).
  • A [logging] section in local.toml schema (extends LocalConfigSchema).
  • Three new HTTP endpoints in the daemon.
  • The UI log drawer stops being a placeholder and starts fetching real data.
  • Plugin JSON-RPC protocol gains a log notification method.
  • The existing @kaged/utils/logger.ts gets adopted into the daemon's startup path.

What this forecloses

  • No third-party log sinks in this ADR. The operator can point external tools at the flat files or the SQLite DB. A future plugin could add Loki/Elasticsearch forwarding, but that's not spec'd here.
  • No log forwarding to Langfuse. Langfuse is for LLM traces (ADR-0013). Operational logs are a separate concern.
  • No UI log emission. The UI is a consumer only. Frontend errors go to the browser console, not to the daemon's log pipeline.

What becomes easier

  • Debugging runtime errors without SSH/tmux. The operator opens the log drawer, filters to error, and sees the actual failure message with context.
  • Plugin debugging. Plugin authors can emit structured logs that appear in the same drawer as daemon logs, filterable by source.
  • Audit trail for operational events. "Did the plugin fire?" is answerable from the log, not just from the compaction result.
  • External integration. Flat files are grep-friendly, shippable to any log collector the operator already runs.

What becomes harder

  • Storage growth. The logs table needs retention enforcement. The daemon prunes on boot and periodically (configurable interval). The operator must be aware that increasing max_entries or retention_days increases DB size.
  • Migration. All console.error calls need converting. It's mechanical but it touches many files.

Alternatives considered

Alternative A — File logs only, no SQLite

Why tempting: Simpler. No schema change, no new endpoint. Operator greps files.

Why rejected: The UI log drawer needs paginated, filterable, searchable access to logs. Implementing that over flat files means re-implementing a query engine. SQLite already does this. The whole point is to make the UI drawer work.

Alternative B — SQLite only, no flat files

Why tempting: Single source of truth. No dual-write complexity.

Why rejected: Flat files survive DB corruption, are grep-friendly, and are the standard interface for external log collectors. A DROP TABLE logs or a corrupted SQLite file should not be the only way to lose operational history. Dual-sink is worth the minor write overhead.

Alternative C — Use the existing StructuredLogEntry from @kaged/harness

Why tempting: Type already exists in packages/harness/src/types.ts.

Why rejected: That type is harness-scoped (level: "info" | "warn" | "error", no debug, no source, no project_id). The operational log schema needs broader fields. Extend rather than conflate.

Alternative D — Winston / Pino / Bunyan

Why tempting: Battle-tested, feature-rich.

Why rejected: ADR-0004 mandates Bun-native runtime. @kaged/utils/logger.ts already implements the core features (levels, rotation, JSON structured output) using Bun built-ins. Adding an npm logging dependency contradicts the "Bun built-ins first" posture for something this fundamental.

References

  • ADR-0004 — Runtime is Bun + TypeScript
  • ADR-0005 — Storage default is SQLite
  • ADR-0008 — Plugins are subprocesses over JSON-RPC on stdio
  • ADR-0013 — Observability substrate is Langfuse (operational logging is the fallback)
  • ADR-0023 — Project-plugin lifecycle hooks
  • ADR-0024 — Context compaction (the feature that triggered this ADR — the compaction error was invisible without structured logging)