Session-Based Context Architecture

The Problem: Stateless Tax

Every API request the enterprise assistant receives today pays a full context reconstruction cost — regardless of whether it's the first message or the tenth in a conversation. We measured this directly against production:

12,058

Input tokens per request
(even "What is 2+2?")

7–11s

Warm simple query latency
(no tools)

34s

Cold start latency
(container scaling from zero)

Compare to OpenClaw (the reference implementation this assistant is modeled on):

3–5K

Input tokens per request
(session-based, incremental)

2–4s

Simple query latency
(warm session)

~0s

Per-request context cost
(after session init)

Root cause: WorkspaceReader fetches AGENT.md, USER.md, and all SKILL.md files from OneDrive on every single API request. This rebuilds 12K tokens of system prompt that doesn't change between requests. The per-request architecture is fighting against how LLMs work.

Why Our Architecture Makes This Easy to Fix

Most multi-user applications would face complex session management challenges here. We don't — because of a key architectural decision already made: one Container App per user.

The container IS the session. A user's Container App is dedicated exclusively to them. There's no multi-tenancy to manage. We can load their workspace context once at container startup and hold it in memory forever.

This is exactly how OpenClaw works: context is loaded once when the process starts, held in memory, and every conversation just appends incremental turns.

Two Approaches to Compare

Approach A: Current (Stateless)

main branch

WorkspaceReader called per request
OneDrive fetched on every turn
Full 12K system prompt every call
Conversation history from Postgres per request
Scales to zero (34s cold start)
Simple, no in-memory state

Cost: ~12K tokens × N messages per conversation
Latency: 7–11s warm, 34s cold

Approach B: Session-Based

feature/session-based-context

Workspace loaded once at container startup
Cached in memory, refreshed every 5 min
Lean system prompt (~2K tokens base)
Conversation held in memory, async Postgres flush
min-replicas 1 (no cold starts)
More stateful, needs graceful restart handling

Cost: ~2K base + incremental turns
Latency target: 2–4s warm

Implementation Plan — Branch: `feature/session-based-context`

1. AppContext Singleton (new: `packages/agent_core/app_context.py`)

class AppContext:
    """Loaded once at startup. Held in memory for the container's lifetime."""

    workspace_content: str        # AGENT.md + USER.md merged
    skills_index: list[dict]      # [{name, description, content}]
    base_system_prompt: str       # rendered from workspace_content
    loaded_at: datetime
    user_email: str

    @classmethod
    async def initialize(cls, workspace_reader, user_email) -> "AppContext":
        # Called once from FastAPI lifespan startup event
        ...

    def refresh_if_stale(self, max_age_seconds=300):
        # Check if files changed, reload if needed
        ...

2. FastAPI Lifespan Startup (change: `apps/gateway/main.py`)

from contextlib import asynccontextmanager

@asynccontextmanager
async def lifespan(app: FastAPI):
    # STARTUP: load workspace once
    app.state.ctx = await AppContext.initialize(workspace_reader, user_email)
    app.state.conversations = {}   # conversation_id → message list (in-memory)
    yield
    # SHUTDOWN: flush any pending Postgres writes

app = FastAPI(lifespan=lifespan)

3. In-Memory Conversation Store (change: `packages/context/memory.py`)

class ConversationMemory:
    def __init__(self, db: Database):
        self.db = db
        self._cache: dict[str, list] = {}   # conversation_id → messages

    def get_messages(self, conversation_id: str) -> list:
        if conversation_id in self._cache:
            return self._cache[conversation_id]          # ← fast path, no DB
        msgs = self.db.fetch_all("SELECT ...", (conversation_id,))  # cold load
        self._cache[conversation_id] = msgs
        return msgs

    def add_message(self, conversation_id: str, ...):
        self._cache.setdefault(conversation_id, []).append(msg)
        asyncio.create_task(self._flush_to_postgres(conversation_id, msg))  # async

4. Agent Loop (change: `packages/agent_core/loop.py`)

# Before: workspace loaded per request
system = build_system_prompt(workspace.load_agent_md(), workspace.load_user_md(), ...)

# After: use pre-loaded app context
system = app_ctx.base_system_prompt   # already rendered, no I/O

5. Min-Replicas Fix (infra change)

az containerapp update \
  --name ca-entasst-gateway \
  --resource-group rg-enterprise \
  --min-replicas 1

Costs ~$30/month (one always-warm container) but eliminates 34s cold starts entirely.

Testing & Eval Framework

To compare approaches objectively, we're building a unified eval system with three components:

Component	What It Does	Location
Task Suite	30 golden tasks across 5 categories: simple Q&A, email, calendar, tool-chaining, edge cases. Each has expected behavior and a quality rubric.	`tests/evals/tasks.json`
Eval Runner	Fires each task against a target URL, measures latency / tokens / tool calls / errors, sends output to LLM judge for quality score (1–5)	`tests/evals/run_evals.py`
Results Store	DuckDB database storing every eval run with full metrics. Queryable, diffs between branches, historical trend	`tests/evals/results.duckdb`
Dashboard	Simple FastAPI + HTML dashboard. Shows current vs baseline comparison, latency charts, token usage, quality scores, per-task drill-down. Accessible at `/evals` on the gateway	`apps/gateway/api/evals.py` + `apps/evals-ui/`

Eval Metrics Per Run

Metric	How Measured	Target (Session-Based)
Total latency	Wall clock from request to done event	< 4s simple, < 8s tool
Time to first token	Time from request to first text_delta SSE event	< 1.5s
Input tokens	Returned in done event	< 3K simple, < 5K tool
Tool call accuracy	Did the agent call the expected tool(s)?	> 90%
Error rate	5xx / tool_error events / timeouts	< 2%
Quality score	LLM judge (Claude Haiku) scores 1–5 against rubric	≥ 4.0 average

Dashboard Preview

The eval dashboard will live at /evals on the gateway and show:

Overview card: current branch vs baseline — avg latency, avg tokens, quality score, pass rate, side by side
Latency chart: histogram + p50/p95/p99 per branch
Token efficiency chart: input/output tokens per task category
Task drill-down table: every task, both branches, click to see full response + judge reasoning
Run history: trend over time — shows improvement (or regression) across deploys

Decision Criteria

After running evals against both branches, the session-based approach ships if:

Criterion	Pass Threshold
Avg latency improvement	≥ 40% reduction (7s → < 4.5s)
Token reduction	≥ 50% (12K → < 6K per turn)
Quality score	No regression vs baseline (≥ 4.0)
Error rate	No increase vs baseline
Conversation continuity	History correctly maintained across turns

Analysis based on live production testing against ca-entasst-gateway.calmhill-d9f0bfc3.eastus.azurecontainerapps.io · March 12, 2026

Branch: feature/session-based-context · Report prepared by Chief · Internal use only