Enterprise Assistant — Architecture Plan & Performance Analysis
Branch: feature/session-based-context · March 12, 2026
Every API request the enterprise assistant receives today pays a full context reconstruction cost — regardless of whether it's the first message or the tenth in a conversation. We measured this directly against production:
Compare to OpenClaw (the reference implementation this assistant is modeled on):
WorkspaceReader fetches AGENT.md, USER.md, and all SKILL.md files from OneDrive on every single API request. This rebuilds 12K tokens of system prompt that doesn't change between requests. The per-request architecture is fighting against how LLMs work.
Most multi-user applications would face complex session management challenges here. We don't — because of a key architectural decision already made: one Container App per user.
The container IS the session. A user's Container App is dedicated exclusively to them. There's no multi-tenancy to manage. We can load their workspace context once at container startup and hold it in memory forever.
This is exactly how OpenClaw works: context is loaded once when the process starts, held in memory, and every conversation just appends incremental turns.
main branch
Cost: ~12K tokens × N messages per conversation
Latency: 7–11s warm, 34s cold
feature/session-based-context
Cost: ~2K base + incremental turns
Latency target: 2–4s warm
feature/session-based-contextpackages/agent_core/app_context.py)
class AppContext:
"""Loaded once at startup. Held in memory for the container's lifetime."""
workspace_content: str # AGENT.md + USER.md merged
skills_index: list[dict] # [{name, description, content}]
base_system_prompt: str # rendered from workspace_content
loaded_at: datetime
user_email: str
@classmethod
async def initialize(cls, workspace_reader, user_email) -> "AppContext":
# Called once from FastAPI lifespan startup event
...
def refresh_if_stale(self, max_age_seconds=300):
# Check if files changed, reload if needed
...
apps/gateway/main.py)
from contextlib import asynccontextmanager
@asynccontextmanager
async def lifespan(app: FastAPI):
# STARTUP: load workspace once
app.state.ctx = await AppContext.initialize(workspace_reader, user_email)
app.state.conversations = {} # conversation_id → message list (in-memory)
yield
# SHUTDOWN: flush any pending Postgres writes
app = FastAPI(lifespan=lifespan)
packages/context/memory.py)
class ConversationMemory:
def __init__(self, db: Database):
self.db = db
self._cache: dict[str, list] = {} # conversation_id → messages
def get_messages(self, conversation_id: str) -> list:
if conversation_id in self._cache:
return self._cache[conversation_id] # ← fast path, no DB
msgs = self.db.fetch_all("SELECT ...", (conversation_id,)) # cold load
self._cache[conversation_id] = msgs
return msgs
def add_message(self, conversation_id: str, ...):
self._cache.setdefault(conversation_id, []).append(msg)
asyncio.create_task(self._flush_to_postgres(conversation_id, msg)) # async
packages/agent_core/loop.py)# Before: workspace loaded per request system = build_system_prompt(workspace.load_agent_md(), workspace.load_user_md(), ...) # After: use pre-loaded app context system = app_ctx.base_system_prompt # already rendered, no I/O
az containerapp update \ --name ca-entasst-gateway \ --resource-group rg-enterprise \ --min-replicas 1
Costs ~$30/month (one always-warm container) but eliminates 34s cold starts entirely.
To compare approaches objectively, we're building a unified eval system with three components:
| Component | What It Does | Location |
|---|---|---|
| Task Suite | 30 golden tasks across 5 categories: simple Q&A, email, calendar, tool-chaining, edge cases. Each has expected behavior and a quality rubric. | tests/evals/tasks.json |
| Eval Runner | Fires each task against a target URL, measures latency / tokens / tool calls / errors, sends output to LLM judge for quality score (1–5) | tests/evals/run_evals.py |
| Results Store | DuckDB database storing every eval run with full metrics. Queryable, diffs between branches, historical trend | tests/evals/results.duckdb |
| Dashboard | Simple FastAPI + HTML dashboard. Shows current vs baseline comparison, latency charts, token usage, quality scores, per-task drill-down. Accessible at /evals on the gateway |
apps/gateway/api/evals.py + apps/evals-ui/ |
| Metric | How Measured | Target (Session-Based) |
|---|---|---|
| Total latency | Wall clock from request to done event | < 4s simple, < 8s tool |
| Time to first token | Time from request to first text_delta SSE event | < 1.5s |
| Input tokens | Returned in done event | < 3K simple, < 5K tool |
| Tool call accuracy | Did the agent call the expected tool(s)? | > 90% |
| Error rate | 5xx / tool_error events / timeouts | < 2% |
| Quality score | LLM judge (Claude Haiku) scores 1–5 against rubric | ≥ 4.0 average |
The eval dashboard will live at /evals on the gateway and show:
After running evals against both branches, the session-based approach ships if:
| Criterion | Pass Threshold |
|---|---|
| Avg latency improvement | ≥ 40% reduction (7s → < 4.5s) |
| Token reduction | ≥ 50% (12K → < 6K per turn) |
| Quality score | No regression vs baseline (≥ 4.0) |
| Error rate | No increase vs baseline |
| Conversation continuity | History correctly maintained across turns |
Analysis based on live production testing against ca-entasst-gateway.calmhill-d9f0bfc3.eastus.azurecontainerapps.io · March 12, 2026
Branch: feature/session-based-context · Report prepared by Chief · Internal use only