Session-Based Context Architecture

Enterprise Assistant — Architecture Plan & Performance Analysis

Branch: feature/session-based-context · March 12, 2026

The Problem: Stateless Tax

Every API request the enterprise assistant receives today pays a full context reconstruction cost — regardless of whether it's the first message or the tenth in a conversation. We measured this directly against production:

12,058
Input tokens per request
(even "What is 2+2?")
7–11s
Warm simple query latency
(no tools)
34s
Cold start latency
(container scaling from zero)

Compare to OpenClaw (the reference implementation this assistant is modeled on):

3–5K
Input tokens per request
(session-based, incremental)
2–4s
Simple query latency
(warm session)
~0s
Per-request context cost
(after session init)
Root cause: WorkspaceReader fetches AGENT.md, USER.md, and all SKILL.md files from OneDrive on every single API request. This rebuilds 12K tokens of system prompt that doesn't change between requests. The per-request architecture is fighting against how LLMs work.

Why Our Architecture Makes This Easy to Fix

Most multi-user applications would face complex session management challenges here. We don't — because of a key architectural decision already made: one Container App per user.

The container IS the session. A user's Container App is dedicated exclusively to them. There's no multi-tenancy to manage. We can load their workspace context once at container startup and hold it in memory forever.

This is exactly how OpenClaw works: context is loaded once when the process starts, held in memory, and every conversation just appends incremental turns.


Two Approaches to Compare

Approach A: Current (Stateless)

main branch

Cost: ~12K tokens × N messages per conversation
Latency: 7–11s warm, 34s cold

Approach B: Session-Based

feature/session-based-context

Cost: ~2K base + incremental turns
Latency target: 2–4s warm


Implementation Plan — Branch: feature/session-based-context

1. AppContext Singleton (new: packages/agent_core/app_context.py)

class AppContext:
    """Loaded once at startup. Held in memory for the container's lifetime."""

    workspace_content: str        # AGENT.md + USER.md merged
    skills_index: list[dict]      # [{name, description, content}]
    base_system_prompt: str       # rendered from workspace_content
    loaded_at: datetime
    user_email: str

    @classmethod
    async def initialize(cls, workspace_reader, user_email) -> "AppContext":
        # Called once from FastAPI lifespan startup event
        ...

    def refresh_if_stale(self, max_age_seconds=300):
        # Check if files changed, reload if needed
        ...

2. FastAPI Lifespan Startup (change: apps/gateway/main.py)

from contextlib import asynccontextmanager

@asynccontextmanager
async def lifespan(app: FastAPI):
    # STARTUP: load workspace once
    app.state.ctx = await AppContext.initialize(workspace_reader, user_email)
    app.state.conversations = {}   # conversation_id → message list (in-memory)
    yield
    # SHUTDOWN: flush any pending Postgres writes

app = FastAPI(lifespan=lifespan)

3. In-Memory Conversation Store (change: packages/context/memory.py)

class ConversationMemory:
    def __init__(self, db: Database):
        self.db = db
        self._cache: dict[str, list] = {}   # conversation_id → messages

    def get_messages(self, conversation_id: str) -> list:
        if conversation_id in self._cache:
            return self._cache[conversation_id]          # ← fast path, no DB
        msgs = self.db.fetch_all("SELECT ...", (conversation_id,))  # cold load
        self._cache[conversation_id] = msgs
        return msgs

    def add_message(self, conversation_id: str, ...):
        self._cache.setdefault(conversation_id, []).append(msg)
        asyncio.create_task(self._flush_to_postgres(conversation_id, msg))  # async

4. Agent Loop (change: packages/agent_core/loop.py)

# Before: workspace loaded per request
system = build_system_prompt(workspace.load_agent_md(), workspace.load_user_md(), ...)

# After: use pre-loaded app context
system = app_ctx.base_system_prompt   # already rendered, no I/O

5. Min-Replicas Fix (infra change)

az containerapp update \
  --name ca-entasst-gateway \
  --resource-group rg-enterprise \
  --min-replicas 1

Costs ~$30/month (one always-warm container) but eliminates 34s cold starts entirely.


Testing & Eval Framework

To compare approaches objectively, we're building a unified eval system with three components:

Component What It Does Location
Task Suite 30 golden tasks across 5 categories: simple Q&A, email, calendar, tool-chaining, edge cases. Each has expected behavior and a quality rubric. tests/evals/tasks.json
Eval Runner Fires each task against a target URL, measures latency / tokens / tool calls / errors, sends output to LLM judge for quality score (1–5) tests/evals/run_evals.py
Results Store DuckDB database storing every eval run with full metrics. Queryable, diffs between branches, historical trend tests/evals/results.duckdb
Dashboard Simple FastAPI + HTML dashboard. Shows current vs baseline comparison, latency charts, token usage, quality scores, per-task drill-down. Accessible at /evals on the gateway apps/gateway/api/evals.py + apps/evals-ui/

Eval Metrics Per Run

Metric How Measured Target (Session-Based)
Total latency Wall clock from request to done event < 4s simple, < 8s tool
Time to first token Time from request to first text_delta SSE event < 1.5s
Input tokens Returned in done event < 3K simple, < 5K tool
Tool call accuracy Did the agent call the expected tool(s)? > 90%
Error rate 5xx / tool_error events / timeouts < 2%
Quality score LLM judge (Claude Haiku) scores 1–5 against rubric ≥ 4.0 average

Dashboard Preview

The eval dashboard will live at /evals on the gateway and show:


Decision Criteria

After running evals against both branches, the session-based approach ships if:

Criterion Pass Threshold
Avg latency improvement ≥ 40% reduction (7s → < 4.5s)
Token reduction ≥ 50% (12K → < 6K per turn)
Quality score No regression vs baseline (≥ 4.0)
Error rate No increase vs baseline
Conversation continuity History correctly maintained across turns

Analysis based on live production testing against ca-entasst-gateway.calmhill-d9f0bfc3.eastus.azurecontainerapps.io · March 12, 2026

Branch: feature/session-based-context · Report prepared by Chief · Internal use only