ADR-0026: Cost management, model metadata overrides, and provider usage tracking
- Status: Proposed
- Date: 2026-05-30
- Deciders: @karasu
- Supersedes: —
- Superseded by: —
Context
Kaged routes LLM calls through @kaged/llm (per ADR-0014) and loads model metadata from a bundled LiteLLM snapshot (per docs/specs/llm.md § Model metadata catalog). The snapshot provides context windows, per-token pricing, and capability flags for 1000+ models. But three gaps are now load-bearing:
No override mechanism. Operators cannot correct stale pricing, fix wrong context windows, or add metadata for self-hosted models (Ollama, vLLM, fine-tunes) that LiteLLM doesn't cover. The compaction system (per ADR-0024) uses
maxInputTokensfrom this catalog to calculate thresholds — wrong data means wrong compaction timing.No spend limits. Long-running agent sessions can accumulate cost without bound. Operators running kaged against paid providers (Anthropic, OpenAI, Google) need to cap spend per rolling window. Providers with rolling-window quotas (Antigravity, Z.AI) penalize operators who hit 100% — kaged should allow reserving a percentage of the window for other tools.
No usage visibility.
@kaged/llmalready has provider-specific usage fetchers (fetchAntigravityUsage,fetchZaiUsage,fetchFireworksUsage) returning normalizedUsageReportdata. But these are not wired into the daemon or the UI. Operators cannot see quota status, cost trajectory, or budget exhaustion until something breaks.
Three converging pressures:
- Cost is real. Multi-hour sessions with recursive agents (per ADR-0022) and compaction summarizer calls (per ADR-0024) can cost $10-50+ per session. Operators need to see and control this.
- Rolling-window providers penalize exhaustion. Antigravity rate-limits you for hitting 100% of your quota window. Operators want kaged to use, say, 50% of the window and leave the rest for other tools. This requires percentage-based limits, not just dollar caps.
- Self-hosted models need metadata. Ollama and vLLM models have no LiteLLM entry. Operators need to set context windows and capabilities manually so compaction thresholds work correctly.
Decision
Kaged gains a model metadata override system (DB-stored, merged with LiteLLM defaults), per-provider spend limits with rolling-window awareness, an on-demand provider usage pipeline (fetch → cache → invalidate on LLM call), and session-level cost surfacing. The override system is the mechanism by which operators control context window sizes for compaction (extending ADR-0024) and correct model metadata without waiting for a LiteLLM snapshot update.
Specifics
1. Model metadata override system
- Storage. A
model_overridestable in SQLite (@kaged/storage):provider TEXT, model_id TEXT, field TEXT, value TEXT, updated_at INTEGER. Primary key on(provider, model_id, field). Sparse — only overridden fields have rows. - Merge semantics.
@kaged/llmgains aresolveModelMeta(provider, modelId, overrides)function. The resolution order:- Operator override from DB (highest priority)
- Bundled LiteLLM snapshot (lowest priority)
- Fields overridable. All scalar fields on
ModelMeta:maxInputTokens,maxOutputTokens, all pricing fields (input,output,reasoning,cacheRead,cacheWrite), all capability booleans,deprecationDate,tokenizer. Values are stored as JSON-encoded strings in thevaluecolumn — numbers, booleans, strings, and null are all representable. - Models not in LiteLLM. When a model has no LiteLLM entry (common for self-hosted), overrides are the only source.
resolveModelMetareturns aModelMetabuilt entirely from overrides, with missing fields defaulting to null/false. - Daemon endpoints. CRUD for overrides:
GET /api/v1/local/providers/:name/models/:modelId/meta— returns merged metadata (defaults + overrides), flagging which fields are overridden.PUT /api/v1/local/providers/:name/models/:modelId/overrides— upsert one or more field overrides.DELETE /api/v1/local/providers/:name/models/:modelId/overrides— delete specific overrides (revert to defaults).DELETE /api/v1/local/providers/:name/models/:modelId/overrides/:field— delete one field.
- No local.toml involvement. Overrides live in the DB, not in
local.toml. Model metadata is operational data that changes independently of provider credentials and connection config. Mixing the two would make local.toml unmanageable.
2. Spend limits
- Storage. A
provider_spend_limitstable:provider TEXT PRIMARY KEY, max_spend_5h_usd REAL, max_spend_7d_usd REAL, max_window_pct_5h REAL, max_window_pct_7d REAL, updated_at INTEGER. - Dollar-based limits.
max_spend_5h_usdandmax_spend_7d_usdcap kaged's cumulative spend per rolling window. If undefined (NULL), not enforced. - Percentage-based limits.
max_window_pct_5handmax_window_pct_7dcap kaged's share of a provider's rolling-window quota. Expressed as a fraction (0.0–1.0). For providers that expose usage as percentages (Antigravity), kaged compares its consumed fraction against this cap. For providers with dollar-based quotas, this field is irrelevant. - Enforcement point. The daemon checks limits before each LLM call. If a limit would be exceeded:
- The call is rejected (hard block, not a warning). The operator gets a clear error: which limit, what the current spend is, when the window resets.
- If a fallback chain is configured (future; not v0), the daemon may try the next provider. If no fallback exists, the session pauses and the operator is notified.
- Spend accumulation. The daemon tracks cumulative spend per provider in a
provider_spend_eventstable:id TEXT, provider TEXT, model_id TEXT, session_id TEXT, cost_usd REAL, window_5h_key TEXT, window_7d_key TEXT, created_at INTEGER. Thewindow_*_keycolumns are derived fromcreated_at(e.g.window_5h_key = floor(created_at / (5 * 3600 * 1000))) to enable efficient range queries. - Spend query. Before each call, the daemon sums
cost_usdfromprovider_spend_eventsfor the current rolling window and compares against the limit.
3. Provider usage pipeline
- On-demand fetch. The daemon fetches provider usage when the UI requests it via
GET /api/v1/local/providers/:name/usage. Not polled on a schedule. - Cache. The fetched
UsageReportis cached in the DB (provider_usage_cachetable:provider TEXT PRIMARY KEY, report_json TEXT, fetched_at INTEGER). The UI response includesfetched_atso the operator sees data freshness. - Cache invalidation. After every LLM call to a provider, the daemon invalidates that provider's usage cache. The call changed usage; the cached report is stale.
- Manual refresh.
POST /api/v1/local/providers/:name/usage/refreshforces a fresh fetch regardless of cache state. For the rare case where something else used the provider outside kaged. - Fetcher dispatch. The daemon maintains a
provider → fetchermapping. Each provider that supports usage reporting gets its fetcher from@kaged/llmwired to the appropriate credential source. Providers without fetchers return{ ok: false, error: "no_usage_endpoint" }.
4. Session cost surfacing
- Per-message cost. Already tracked — the
messages.cost_totalcolumn and thestats.costfield onmessage.endevents. - Cumulative session cost. The daemon computes session-level cost by summing
cost_totalfrom all non-superseded messages in the session. This is returned in the session detail API response and relayed to the UI after each message. - Provider usage in session view. When loading a session or after each message, the UI shows:
- Cumulative session cost ($).
- Provider usage status: % of window used, spend vs limit, time until reset (from cached
UsageReport). - Visual indicator (progress bar or badge) when approaching limits.
- Model selection awareness. The session view's model picker (if present) shows per-model pricing alongside the current provider's budget status, so operators can choose cheaper models when budget is tight.
5. Context size overrides (ADR-0024 extension)
- The compaction system (ADR-0024) calculates thresholds against the model's context window. That context window comes from
ModelMeta.maxInputTokens. - The override system above allows operators to set
maxInputTokensandmaxOutputTokensper provider+model. - When compaction runs, it uses the effective context window (default + override). This gives operators direct control over compaction timing without touching DSL compaction thresholds.
- No new ADR needed for this — it's a natural consequence of the override system. ADR-0024 gains an amendment noting that
maxInputTokensis overridable.
Consequences
What this commits us to
- A
model_overridestable in@kaged/storage— sparse key-value per field, merged at read time. - A
provider_spend_limitstable in@kaged/storage— per-provider spend configuration. - A
provider_spend_eventstable in@kaged/storage— append-only cost ledger per LLM call. - A
provider_usage_cachetable in@kaged/storage— cachedUsageReportJSON per provider. - A
resolveModelMeta(provider, modelId, overrides)function in@kaged/llm— the merge path. - Four new daemon API endpoints for model override CRUD.
- Three new daemon API endpoints for provider usage (fetch cached, force refresh, and the spend limits CRUD).
- Spend enforcement in the daemon's LLM dispatch path — before each call, check limits, reject if exceeded.
- Cache invalidation logic — daemon invalidates provider usage cache after each LLM call.
- UI enhancement: provider settings model metadata table with toggleable columns, inline editing, default/override visual distinction.
- UI enhancement: session view cost panel with cumulative cost, provider usage status, limit proximity indicators.
- Spec amendments to
docs/specs/llm.md,docs/specs/http-api.md,docs/specs/ui/README.md. - ADR-0024 amendment for context size overrides.
- STATUS.md update.
What this forecloses
- Local.toml as the override location. Model metadata overrides are operational data, not connection config. They belong in the DB where they can be managed per-model without editing config files.
- Soft warnings for spend limits. Exceeding a spend limit is a hard block. The operator set the limit because they wanted it enforced. Soft warnings are easily dismissed and defeat the purpose. The error message is clear; the operator can raise the limit or wait for the window to reset.
- Per-model spend limits. v0 scopes limits per-provider. Per-model limits add combinatorial complexity (a session may use multiple models in one provider). If needed, a future amendment can add them. Starting per-provider keeps the schema and enforcement logic simple.
What becomes easier
- Self-hosted model management. Operators can add context windows and pricing for Ollama/vLLM models directly, making compaction and cost tracking work for every model.
- Correcting stale LiteLLM data. When a provider changes pricing or context windows, operators can override immediately without waiting for a kaged release with an updated snapshot.
- Cost control. Operators can run long sessions without surprise bills. The spend limits are the safety net.
- Provider budget awareness. The session view shows operators exactly where they stand before they commit to another expensive call.
- Multi-tool budget sharing. Percentage-based limits let operators reserve quota for other tools running against the same provider.
What becomes harder
- Storage surface. Four new tables. Each needs a migration, mapper functions, and test coverage.
- Daemon dispatch path complexity. Every LLM call now goes through a spend-limit check before reaching the harness. This is a synchronous gate; it must be fast (single DB query).
- UI complexity. The model metadata table with toggleable columns, inline editing, and default/override visual distinction is a substantial UI component. The session cost panel adds another.
- Merge semantics testing. The
resolveModelMetamerge path has many edge cases: override exists but LiteLLM doesn't, LiteLLM exists but no overrides, both exist with partial overlaps, null handling. All must be tested.
Alternatives considered
Alternative A — Overrides in local.toml
Store overrides alongside provider credentials in local.toml.
Why tempting: No new DB table. Overrides live next to the provider config they modify. Simple mental model.
Why rejected: local.toml is for connection config (credentials, base URLs, model lists). Model metadata overrides are operational data — they change frequently, they're per-model (potentially dozens per provider), and they have different access patterns (CRUD from UI, not file edits). Mixing them into local.toml makes the file unmanageable and requires config file writes from the daemon, which is fragile. The DB is the right place for operational data.
Alternative B — Wide override row per model
One row per (provider, model_id) with columns for every overridable field.
Why tempting: Fewer rows. Natural to think "one model = one row." Easier joins.
Why rejected: ModelMeta has 20+ overridable fields. Most overrides touch 2-3 fields (context window, pricing). A wide row means 17+ NULL columns per override. Adding new fields requires a migration. The sparse key-value approach means adding a new overridable field requires zero schema changes — just a new field value. The merge logic is the same either way; the storage difference is negligible for the volumes involved (operators override a handful of models, not thousands).
Alternative C — Soft warnings for spend limits
When a spend limit is approached or exceeded, show a warning but allow the call to proceed.
Why tempting: Operators hate blocked workflows. A warning is less disruptive.
Why rejected: The operator set the limit because they wanted it enforced. If the limit is advisory, operators will learn to ignore it (alarm fatigue). The hard block forces a conscious decision: raise the limit, wait for the window to reset, or switch to a cheaper model. The error message provides all the information needed to make that decision. This is the same philosophy as compaction thresholds — the operator configures the boundary, kaged enforces it.
Alternative D — Per-model spend limits
Spend limits scoped to individual models within a provider.
Why tempting: Fine-grained control. An operator might want to cap Claude Opus at $5/5hr but allow Claude Sonnet unlimited.
Why rejected: Adds combinatorial complexity to the enforcement path. A single session may use multiple models (compaction summarizer uses a different model than the primary agent). The daemon must check limits per-call, not per-session. Per-model limits mean the enforcement query must match on (provider, model_id) and the UI must manage N×M limit configurations. Starting per-provider keeps the schema and enforcement simple. A future amendment can add per-model granularity if needed.
Alternative E — Scheduled usage polling
Fetch provider usage on a fixed interval (every N minutes) instead of on-demand.
Why tempting: The UI always has fresh data. No explicit "refresh" button needed.
Why rejected: Most provider usage endpoints have their own rate limits. Polling every 5 minutes for a provider the operator hasn't used in hours wastes quota. On-demand + cache-with-invalidation-on-LLM-call means the data is always fresh when it matters (the operator just made a call) and not fetched when it doesn't (idle time). Manual refresh covers the rare "something else used my provider" case.
Alternative F — Override only pricing, not capabilities
Allow operators to override pricing but not capability flags.
Why tempting: Capabilities are binary and usually correct in LiteLLM. Pricing changes more often.
Why rejected: Self-hosted models (Ollama, vLLM) have no LiteLLM entry at all — both pricing and capabilities are missing. The compaction system depends on maxInputTokens (a capability-adjacent field). The override system must be general enough to cover the self-hosted case, which means supporting all fields. Restricting it to pricing would require a parallel mechanism for capabilities — more complexity, not less.
References
- ADR-0014: All LLM providers route through
@kaged/llm - ADR-0024: Context compaction is kaged-owned, layered, observable, and operator-tunable
- ADR-0005: Storage default is SQLite
- ADR-0003: Doc-first, then TDD
- Spec: LLM Provider Interface — model metadata catalog, usage fetchers
- Spec: HTTP API — daemon endpoints
- Spec: Storage (inline in
@kaged/storage) — existing tables