ADR-0026: Cost management, model metadata overrides, and provider usage tracking

Status: Accepted
Date: 2026-05-30
Deciders: @karasu
Supersedes: —
Superseded by: —

Context

Kaged routes LLM calls through @kaged/llm (per ADR-0014) and loads model metadata from a bundled LiteLLM snapshot (per docs/specs/llm.md § Model metadata catalog). The snapshot provides context windows, per-token pricing, and capability flags for 1000+ models. But three gaps are now load-bearing:

No override mechanism. Operators cannot correct stale pricing, fix wrong context windows, or add metadata for self-hosted models (Ollama, vLLM, fine-tunes) that LiteLLM doesn't cover. The compaction system (per ADR-0024) uses maxInputTokens from this catalog to calculate thresholds — wrong data means wrong compaction timing.
No spend limits. Long-running agent sessions can accumulate cost without bound. Operators running kaged against paid providers (Anthropic, OpenAI, Google) need to cap spend per rolling window. Providers with rolling-window quotas (Antigravity, Z.AI) penalize operators who hit 100% — kaged should allow reserving a percentage of the window for other tools.
No usage visibility. @kaged/llm already has provider-specific usage fetchers (fetchAntigravityUsage, fetchZaiUsage, fetchFireworksUsage) returning normalized UsageReport data. But these are not wired into the daemon or the UI. Operators cannot see quota status, cost trajectory, or budget exhaustion until something breaks.

Three converging pressures:

Cost is real. Multi-hour sessions with recursive agents (per ADR-0022) and compaction summarizer calls (per ADR-0024) can cost $10-50+ per session. Operators need to see and control this.
Rolling-window providers penalize exhaustion. Antigravity rate-limits you for hitting 100% of your quota window. Operators want kaged to use, say, 50% of the window and leave the rest for other tools. This requires percentage-based limits, not just dollar caps.
Self-hosted models need metadata. Ollama and vLLM models have no LiteLLM entry. Operators need to set context windows and capabilities manually so compaction thresholds work correctly.

Decision

Kaged gains a model metadata override system (DB-stored, merged with LiteLLM defaults), per-provider spend limits with rolling-window awareness, an on-demand provider usage pipeline (fetch → cache → invalidate on LLM call), and session-level cost surfacing. The override system is the mechanism by which operators control context window sizes for compaction (extending ADR-0024) and correct model metadata without waiting for a LiteLLM snapshot update.

Specifics

1. Model metadata override system

Storage. A model_overrides table in SQLite (@kaged/storage): provider TEXT, model_id TEXT, field TEXT, value TEXT, updated_at INTEGER. Primary key on (provider, model_id, field). Sparse — only overridden fields have rows.
Merge semantics. @kaged/llm gains a resolveModelMeta(provider, modelId, overrides) function. The resolution order:
1. Operator override from DB (highest priority)
2. Bundled LiteLLM snapshot (lowest priority)
Fields overridable. All scalar fields on ModelMeta: maxInputTokens, maxOutputTokens, all pricing fields (input, output, reasoning, cacheRead, cacheWrite), all capability booleans, deprecationDate, tokenizer. Values are stored as JSON-encoded strings in the value column — numbers, booleans, strings, and null are all representable.
Models not in LiteLLM. When a model has no LiteLLM entry (common for self-hosted), overrides are the only source. resolveModelMeta returns a ModelMeta built entirely from overrides, with missing fields defaulting to null/false.
Daemon endpoints. CRUD for overrides:
- GET /api/v1/local/providers/:name/models/:modelId/meta — returns merged metadata (defaults + overrides), flagging which fields are overridden.
- PUT /api/v1/local/providers/:name/models/:modelId/overrides — upsert one or more field overrides.
- DELETE /api/v1/local/providers/:name/models/:modelId/overrides — delete specific overrides (revert to defaults).
- DELETE /api/v1/local/providers/:name/models/:modelId/overrides/:field — delete one field.
No local.toml involvement. Overrides live in the DB, not in local.toml. Model metadata is operational data that changes independently of provider credentials and connection config. Mixing the two would make local.toml unmanageable.

2. Spend limits

Storage. A provider_spend_limits table: provider TEXT PRIMARY KEY, max_spend_5h_usd REAL, max_spend_7d_usd REAL, max_window_pct_5h REAL, max_window_pct_7d REAL, updated_at INTEGER.
Dollar-based limits. max_spend_5h_usd and max_spend_7d_usd cap kaged's cumulative spend per rolling window. If undefined (NULL), not enforced.
Percentage-based limits. max_window_pct_5h and max_window_pct_7d cap kaged's share of a provider's rolling-window quota. Expressed as a fraction (0.0–1.0). For providers that expose usage as percentages (Antigravity), kaged compares its consumed fraction against this cap. For providers with dollar-based quotas, this field is irrelevant.
Enforcement point. The daemon checks limits before each LLM call. If a limit would be exceeded:
- The call is rejected (hard block, not a warning). The operator gets a clear error: which limit, what the current spend is, when the window resets.
- If a fallback chain is configured (future; not v0), the daemon may try the next provider. If no fallback exists, the session pauses and the operator is notified.
Spend accumulation. The daemon tracks cumulative spend per provider in a provider_spend_events table: id TEXT, provider TEXT, model_id TEXT, session_id TEXT, cost_usd REAL, window_5h_key TEXT, window_7d_key TEXT, created_at INTEGER. The window_*_key columns are derived from created_at (e.g. window_5h_key = floor(created_at / (5 * 3600 * 1000))) to enable efficient range queries.
Spend query. Before each call, the daemon sums cost_usd from provider_spend_events for the current rolling window and compares against the limit.

3. Provider usage pipeline

On-demand fetch. The daemon fetches provider usage when the UI requests it via GET /api/v1/local/providers/:name/usage. Not polled on a schedule.
Cache. The fetched UsageReport is cached in the DB (provider_usage_cache table: provider TEXT PRIMARY KEY, report_json TEXT, fetched_at INTEGER). The UI response includes fetched_at so the operator sees data freshness.
Cache invalidation. After every LLM call to a provider, the daemon invalidates that provider's usage cache. The call changed usage; the cached report is stale.
Manual refresh. POST /api/v1/local/providers/:name/usage/refresh forces a fresh fetch regardless of cache state. For the rare case where something else used the provider outside kaged.
Fetcher dispatch. The daemon maintains a provider → fetcher mapping. Each provider that supports usage reporting gets its fetcher from @kaged/llm wired to the appropriate credential source. Providers without fetchers return { ok: false, error: "no_usage_endpoint" }.

4. Session cost surfacing

Per-message cost. Already tracked — the messages.cost_total column and the stats.cost field on message.end events.
Cumulative session cost. The daemon computes session-level cost by summing cost_total from all non-superseded messages in the session. This is returned in the session detail API response and relayed to the UI after each message.
Provider usage in session view. When loading a session or after each message, the UI shows:
- Cumulative session cost ($).
- Provider usage status: % of window used, spend vs limit, time until reset (from cached UsageReport).
- Visual indicator (progress bar or badge) when approaching limits.
Model selection awareness. The session view's model picker (if present) shows per-model pricing alongside the current provider's budget status, so operators can choose cheaper models when budget is tight.

5. Context size overrides (ADR-0024 extension)

The compaction system (ADR-0024) calculates thresholds against the model's context window. That context window comes from ModelMeta.maxInputTokens.
The override system above allows operators to set maxInputTokens and maxOutputTokens per provider+model.
When compaction runs, it uses the effective context window (default + override). This gives operators direct control over compaction timing without touching DSL compaction thresholds.
No new ADR needed for this — it's a natural consequence of the override system. ADR-0024 gains an amendment noting that maxInputTokens is overridable.

Consequences

What this commits us to

A model_overrides table in @kaged/storage — sparse key-value per field, merged at read time.
A provider_spend_limits table in @kaged/storage — per-provider spend configuration.
A provider_spend_events table in @kaged/storage — append-only cost ledger per LLM call.
A provider_usage_cache table in @kaged/storage — cached UsageReport JSON per provider.
A resolveModelMeta(provider, modelId, overrides) function in @kaged/llm — the merge path.
Four new daemon API endpoints for model override CRUD.
Three new daemon API endpoints for provider usage (fetch cached, force refresh, and the spend limits CRUD).
Spend enforcement in the daemon's LLM dispatch path — before each call, check limits, reject if exceeded.
Cache invalidation logic — daemon invalidates provider usage cache after each LLM call.
UI enhancement: provider settings model metadata table with toggleable columns, inline editing, default/override visual distinction.
UI enhancement: session view cost panel with cumulative cost, provider usage status, limit proximity indicators.
Spec amendments to docs/specs/llm.md, docs/specs/http-api.md, docs/specs/ui/README.md.
ADR-0024 amendment for context size overrides.
STATUS.md update.

What this forecloses

Local.toml as the override location. Model metadata overrides are operational data, not connection config. They belong in the DB where they can be managed per-model without editing config files.
Soft warnings for spend limits. Exceeding a spend limit is a hard block. The operator set the limit because they wanted it enforced. Soft warnings are easily dismissed and defeat the purpose. The error message is clear; the operator can raise the limit or wait for the window to reset.
Per-model spend limits. v0 scopes limits per-provider. Per-model limits add combinatorial complexity (a session may use multiple models in one provider). If needed, a future amendment can add them. Starting per-provider keeps the schema and enforcement logic simple.

What becomes easier

Self-hosted model management. Operators can add context windows and pricing for Ollama/vLLM models directly, making compaction and cost tracking work for every model.
Correcting stale LiteLLM data. When a provider changes pricing or context windows, operators can override immediately without waiting for a kaged release with an updated snapshot.
Cost control. Operators can run long sessions without surprise bills. The spend limits are the safety net.
Provider budget awareness. The session view shows operators exactly where they stand before they commit to another expensive call.
Multi-tool budget sharing. Percentage-based limits let operators reserve quota for other tools running against the same provider.

What becomes harder

Storage surface. Four new tables. Each needs a migration, mapper functions, and test coverage.
Daemon dispatch path complexity. Every LLM call now goes through a spend-limit check before reaching the harness. This is a synchronous gate; it must be fast (single DB query).
UI complexity. The model metadata table with toggleable columns, inline editing, and default/override visual distinction is a substantial UI component. The session cost panel adds another.
Merge semantics testing. The resolveModelMeta merge path has many edge cases: override exists but LiteLLM doesn't, LiteLLM exists but no overrides, both exist with partial overlaps, null handling. All must be tested.

Alternatives considered

Alternative A — Overrides in local.toml

Store overrides alongside provider credentials in local.toml.

Why tempting: No new DB table. Overrides live next to the provider config they modify. Simple mental model.

Why rejected: local.toml is for connection config (credentials, base URLs, model lists). Model metadata overrides are operational data — they change frequently, they're per-model (potentially dozens per provider), and they have different access patterns (CRUD from UI, not file edits). Mixing them into local.toml makes the file unmanageable and requires config file writes from the daemon, which is fragile. The DB is the right place for operational data.

Alternative B — Wide override row per model

One row per (provider, model_id) with columns for every overridable field.

Why tempting: Fewer rows. Natural to think "one model = one row." Easier joins.

Why rejected: ModelMeta has 20+ overridable fields. Most overrides touch 2-3 fields (context window, pricing). A wide row means 17+ NULL columns per override. Adding new fields requires a migration. The sparse key-value approach means adding a new overridable field requires zero schema changes — just a new field value. The merge logic is the same either way; the storage difference is negligible for the volumes involved (operators override a handful of models, not thousands).

Alternative C — Soft warnings for spend limits

When a spend limit is approached or exceeded, show a warning but allow the call to proceed.

Why tempting: Operators hate blocked workflows. A warning is less disruptive.

Why rejected: The operator set the limit because they wanted it enforced. If the limit is advisory, operators will learn to ignore it (alarm fatigue). The hard block forces a conscious decision: raise the limit, wait for the window to reset, or switch to a cheaper model. The error message provides all the information needed to make that decision. This is the same philosophy as compaction thresholds — the operator configures the boundary, kaged enforces it.

Alternative D — Per-model spend limits

Spend limits scoped to individual models within a provider.

Why tempting: Fine-grained control. An operator might want to cap Claude Opus at $5/5hr but allow Claude Sonnet unlimited.

Why rejected: Adds combinatorial complexity to the enforcement path. A single session may use multiple models (compaction summarizer uses a different model than the primary agent). The daemon must check limits per-call, not per-session. Per-model limits mean the enforcement query must match on (provider, model_id) and the UI must manage N×M limit configurations. Starting per-provider keeps the schema and enforcement simple. A future amendment can add per-model granularity if needed.

Alternative E — Scheduled usage polling

Fetch provider usage on a fixed interval (every N minutes) instead of on-demand.

Why tempting: The UI always has fresh data. No explicit "refresh" button needed.

Why rejected: Most provider usage endpoints have their own rate limits. Polling every 5 minutes for a provider the operator hasn't used in hours wastes quota. On-demand + cache-with-invalidation-on-LLM-call means the data is always fresh when it matters (the operator just made a call) and not fetched when it doesn't (idle time). Manual refresh covers the rare "something else used my provider" case.

Alternative F — Override only pricing, not capabilities

Allow operators to override pricing but not capability flags.

Why tempting: Capabilities are binary and usually correct in LiteLLM. Pricing changes more often.

Why rejected: Self-hosted models (Ollama, vLLM) have no LiteLLM entry at all — both pricing and capabilities are missing. The compaction system depends on maxInputTokens (a capability-adjacent field). The override system must be general enough to cover the self-hosted case, which means supporting all fields. Restricting it to pricing would require a parallel mechanism for capabilities — more complexity, not less.

References

ADR-0014: All LLM providers route through @kaged/llm
ADR-0024: Context compaction is kaged-owned, layered, observable, and operator-tunable
ADR-0005: Storage default is SQLite
ADR-0003: Doc-first, then TDD
Spec: LLM Provider Interface — model metadata catalog, usage fetchers
Spec: HTTP API — daemon endpoints
Spec: Storage (inline in @kaged/storage) — existing tables