Purpose
Implement a multi-tier memory system that enables a production AI agent to build, maintain, and recall a persistent understanding of its user across all interaction surfaces — interactive chat, autonomous background processing, and inbox analysis.
The architecture must support:
- Continuous learning from all interaction channels without explicit user instruction
- LLM-mediated memory consolidation that synthesizes raw observations into structured knowledge
- Semantic deep recall of facts, preferences, and context at query time
- Schema-free user modeling where the agent organically discovers and evolves its own ontology per user
The system is inspired by human memory consolidation: observations are captured in real-time, buffered as events, and periodically synthesized by a background process — analogous to how short-term memories consolidate into long-term storage during sleep.
1. Core Concepts
1.1 Memory Tiers
Define three tiers M = {W, S, V}, each serving a distinct cognitive purpose:
| Tier | Name | Storage | Purpose | Analogy |
|---|---|---|---|---|
| W | Working Memory | Text column | Continuously LLM-synthesized narrative of current context, projects, and priorities. Capped at ~500 words. | Prefrontal cortex — "what am I thinking about right now" |
| S | Information Scaffold | JSON column | Dynamic, schema-free structured profile. Keys are created organically (e.g., Role, Goals, Projects, Preferences). The agent invents its own ontology per user. | Declarative memory — "what I know about this person" |
| V | Vector Long-Term Memory | ChromaDB collection | Semantic embedding store with rich metadata filtering for deep recall of facts, attachment content, and tool outputs. | Episodic memory — "something I learned once and can retrieve if relevant" |
1.2 Event Bus Model
All writes to memory — from both interactive and autonomous paths — flow through a unified event bus:
Observation → ContextEvent (DB row, processed=false) → MemoryProcessor → {W, S, V}
This ensures:
- Consistent memory quality regardless of source
- Decoupled write and synthesis paths
- Audit trail of every observation
1.3 Write Paths
Two distinct write paths feed the same event bus:
- Interactive Path: The
context_tool.record_observationtool, invoked by the LLM during chat when it detects new user information (facts, preferences, project details, corrections). - Autonomous Path: The
process_autonomous_cyclebackground worker, which createsContextEventrecords of typescaffold_updateandlearned_factwhen processing inbox items.
1.4 Read Paths
Memory is consumed in two ways:
- Injection: At the start of every chat turn, Working Memory and Information Scaffold are injected directly into the system prompt as context. This ensures every conversation is grounded in accumulated knowledge.
- Retrieval: The
ContextServiceperforms semantic search against the Vector store at query time, returning the top-N most relevant long-term facts. These are appended to the system prompt as "Relevant Retrieved Facts."
2. Requirements
2.1 Functional Requirements
-
Multi-channel observation capture:
Record observations from interactive chat (context_tool.record_observation), autonomous cycles (ContextEventwithscaffold_update/learned_fact), and tool outputs (e.g., email attachment content indexed into vector store) -
Asynchronous LLM-mediated consolidation:
A backgroundMemoryProcessorsynthesizes pending observations into all three memory tiers using structured JSON output from the LLM -
Semantic retrieval with metadata filtering:
Query-time retrieval must filter byuser_id,connection_id, andcontext_type— returning only facts relevant to the active user, their active connections, and the query semantics -
Schema-free ontology discovery:
The Information Scaffold must support arbitrary, evolving key hierarchies without predefined schema. The LLM decides what categories to create -
Connection-scoped learned information:
Each tool connection maintains its ownpersistent_learned_infoJSON column, enabling per-tool preference storage (e.g., "default Jira project for this connection is PROJ") -
Graceful degradation:
System must function when any tier is empty (new user cold-start) or unavailable (ChromaDB down)
2.2 Non-Functional Requirements
- Latency: Memory injection ≤ 50ms (JSON serialization of
working_memory+information_scaffold). Vector retrieval ≤ 300ms per query (amortized, cached embeddings ok) - Cost control: All embedding operations deducted from user credit balance via
system_embeddingsplugin configuration. The_deduct_costmethod tracks per-operation cost against user balance - Idempotency:
ContextEventrecords are markedprocessed=trueafter successful synthesis. Failed synthesis leaves events unprocessed for retry - Auditability: Every observation includes
source_metadata(source, session_id, connection_id). Every synthesis run producesAgentContextLogentries
3. System Architecture
A) Observation Capture Layer
Two entrypoints feed the event bus:
Interactive (context_tool.record_observation):
User says something → LLM detects fact → calls record_observation(content, category)
→ ContextEvent(event_type=category, content=content, processed=false) saved to DB
Categories: fact, preference, project_details, correction
Autonomous (process_autonomous_cycle):
Inbox items arrive → Autonomous LLM analyzes → calls context_tool__update_user_context
→ ContextEvent(event_type="scaffold_update" | "learned_fact", content=...) saved to DB
B) Memory Processor (Consolidation Engine)
A background job (memory_runner.py) periodically:
- Fetches all
ContextEventwhereprocessed = false - Groups by
user_id - For each user, invokes an LLM with:
- Current Working Memory state
- Current Information Scaffold state
- New observations since last run
- Receives structured JSON output:
{ "updated_scaffold": { "Role": "Engineering Manager", "Projects": {...} }, "updated_working_memory": "User is currently focused on...", "facts_to_embed": ["User prefers short emails", "Default Jira project is PROJ"] } - Applies updates atomically:
User.information_scaffold← merged structured profile (with recursive JSON normalization for LLM-produced stringified values)User.working_memory← rewritten narrativefacts_to_embed[]→ each fact embedded viatext-embedding-3-smalland stored in ChromaDB with full metadata
- Marks all processed events as
processed = true
C) Context Injection Pipeline
At every chat turn (_prepare_and_load_history), the system assembles the agent's context:
1. Working Memory (narrative text) → System prompt "# Your Context" block
2. Information Scaffold (serialized JSON) → "Structured Data:" sub-block
3. Persistent Learned Info (serialized JSON) → "Learned Info:" sub-block
4. Vector Retrieval (semantic search) → "Relevant Retrieved Facts:" block
5. Agent Identity (name, constitution) → "# Agent Identity" block
6. TGL Directives → "# Temporal Governance Directives" block
7. Artifact Index → "Your Artifacts" block
Steps 1–3 are deterministic (fast, always available). Step 4 is semantic (requires ChromaDB + embedding API).
D) Vector Store (ContextService)
Backed by ChromaDB with a single collection (toolstream_context).
Metadata Schema:
context_type ∈ {user_profile, user_persistent, tool_schema, tool_entity,
tool_persistent, tool_attachment_content}
user_id (required, always filtered)
org_id (optional, for future multi-tenancy)
tool_id (optional, scopes facts to a specific tool)
connection_id (optional, scopes facts to a specific connection instance)
source ∈ {user_provided, system_learned, tool_indexed, tool_extracted_content}
message_id (optional, links to originating message)
attachment_id (optional, links to originating attachment)
filename (optional, for attachment content)
Retrieval Filter Logic:
WHERE user_id = current_user
AND (connection_id IN active_connections OR connection_id IS NULL)
This ensures user-level facts are always returned while tool-specific facts are scoped to active connections.
E) Connection-Scoped Memory
Beyond user-level memory, each Connection object maintains:
persistent_learned_info(JSON): Tool-specific preferences (e.g., "default_project: PROJ", "preferred_issue_type: Task")ConnectionContext(table): Cached tool schema/metadata with TTL-based refresh
Updated via context_tool.update_connection_context, which allows the
agent to store per-connection learned information during chat.
4. Data Model
4.1 User Memory Fields
-- On the User table:
working_memory TEXT -- AI-synthesized summary of user context (~500 words)
information_scaffold JSON -- Schema-free structured profile (dynamic keys)
persistent_learned_info JSON -- Adaptive, unstructured learned info
4.2 ContextEvent (Event Bus)
CREATE TABLE context_events (
id UUID PRIMARY KEY,
user_id UUID NOT NULL REFERENCES users(id),
organization_id UUID NOT NULL REFERENCES organizations(id),
event_type VARCHAR(50) NOT NULL, -- scaffold_update, learned_fact, fact, preference, etc.
content TEXT NOT NULL, -- Raw observation text
source_metadata JSON DEFAULT '{}', -- {source, session_id, connection_id, ...}
processed BOOLEAN DEFAULT false,
created_at TIMESTAMPTZ DEFAULT now()
);
4.3 Vector Store Document
document_id: UUID5(content + metadata) -- deterministic, idempotent
embedding: float[1536] -- text-embedding-3-small
document: text -- original observation text
metadata: ContextMetadata -- see §3.D
4.4 Connection Memory Fields
-- On the Connection table:
persistent_learned_info JSON DEFAULT '{}' -- Per-tool learned preferences
5. Algorithms
5.1 Memory Consolidation (MemoryProcessor)
Input: current_scaffold, current_working_memory, new_observations[]
Model: Gemini 3 Flash (fast, cheap)
Output: {updated_scaffold, updated_working_memory, facts_to_embed[]}
Process:
- Fetch all
ContextEvent WHERE processed = false, GROUP BYuser_id - For each user:
- Serialize current scaffold and working memory
- Format new observations as bullet list with event_type labels
- Send structured prompt to LLM requesting JSON output
- Parse response, apply recursive JSON normalization
- Write scaffold and working_memory to User row
- Embed each
fact_to_embedwithsource=system_learned - Mark events as
processed = true - Commit transaction
JSON Normalization: LLMs sometimes return nested data as
stringified JSON (e.g., {"Projects": "{\"toolstream\":\"A project\"}"}). The
_normalize_json_values function recursively unwraps these into proper nested dicts before
storage.
5.2 Semantic Retrieval (ContextService)
Input: query_text, user_id, active_connection_ids[], n_results
Model: text-embedding-3-small (OpenAI)
Store: ChromaDB
Process:
- Generate embedding for
query_text - Query ChromaDB with
query_embedding, WHERE filter (user_idANDconnection_idscope), andn_results(default: 5) - Deduplicate by
document_id - Sort by distance (ascending = most relevant)
- Return
[{content, metadata, distance}]
5.3 Context Assembly (ToolExecutor)
Input: User object, active connections, user query text
Output: Fully assembled system prompt
Process:
- Format
working_memory+ scaffold +persistent_infointo context block - Render system prompt template with context block and timestamp
- Inject agent identity (name, constitution) if configured
- Inject artifact index if artifacts exist
- Inject TGL directives (caution level, tone, framing) if available
- Semantic retrieval: embed user query → ChromaDB → top-5 results
- Append retrieved facts as "Relevant Retrieved Facts" block
- Register
context_tool.record_observationas available LLM tool
5.4 Observation Routing
During interactive chat, when the LLM calls context_tool.record_observation:
- Create
ContextEventwithevent_type = category,content,source_metadata = {source: "interactive", session_id: ...},processed = false - Return success to LLM, continue conversation
MemoryProcessorpicks up event asynchronously
During autonomous cycles:
- LLM calls
context_tool__update_user_contextwithscaffold_updatesand/orpersistent_info_additions - For each key-value in
scaffold_updates: CreateContextEvent(event_type="scaffold_update", ...) - For each item in
persistent_info_additions: CreateContextEvent(event_type="learned_fact", ...) - Commit,
MemoryProcessorsynthesizes later
6. Interaction with Other Systems
6.1 Temporal Governance Layer (TGL)
The TGL consumes memory state as input to its State Estimator. The user's goals (from the Information Scaffold) and current context (from Working Memory) inform horizon weight calculations and chat directive shaping.
6.2 Inbox Ranking
The InboxRanker.score_multiple method uses the user's autonomous_inbox_guidance (a
memory-adjacent field) to modulate LLM-based importance scoring. As the scaffold accumulates project and
priority data, the autonomous system becomes better at scoring items.
6.3 Playbook Engine
Playbooks can reference user context through Jinja2 templates. Knowledge accumulated in memory indirectly improves playbook execution by providing richer system prompts during tool execution.
7. Implementation Plan
Phase 1: Foundation (Shipped)
- Three-tier storage:
working_memory,information_scaffold,persistent_learned_infoon User table ContextEventtable and event bus patternMemoryProcessorwith LLM-mediated consolidation (Gemini 3 Flash)ContextServicewith ChromaDB andtext-embedding-3-smallembeddingscontext_tool.record_observationfor interactive capturecontext_tool.update_connection_contextfor per-connection learning- Context injection pipeline in
ToolExecutor._prepare_and_load_history - Semantic retrieval at query time
Phase 2: Enrichment (In Progress)
- Autonomous write path via
process_autonomous_cyclefeedingContextEventbus - Recursive JSON normalization for LLM-produced scaffold values
AgentContextLogfor full audit trail of injected context per request- Cost tracking for all embedding operations via
system_embeddingsplugin
Phase 3: Advanced (Planned)
- Confidence scoring on scaffold entries (how certain is the agent about each fact?)
- Temporal decay on Working Memory entries (automatically age out stale context)
- Contradiction detection (new facts that conflict with existing scaffold entries)
- User-facing memory inspector (show the user what the agent has learned)
- Cross-session learning (synthesize patterns across multiple conversations)
- Memory-aware tool suggestion (recommend tools based on learned user patterns)
8. Evaluation & Experiments
8.1 Offline Metrics
- Scaffold accuracy: Agreement between agent-learned facts and ground truth user profiles
- Retrieval relevance: Precision@5 and NDCG@5 for semantic retrieval against labeled query-fact pairs
- Consolidation quality: LLM judge scoring of working memory narratives for completeness, conciseness, and recency
- Cold-start convergence: Number of interactions required to build a useful scaffold from empty state
8.2 Online Metrics
- Context utilization rate: Proportion of injected facts that appear in the agent's final response
- Observation rate: Average
context_tool.record_observationcalls per conversation - Scaffold growth curve: Key count over time per user (healthy growth should be logarithmic, not linear)
- User correction rate: Frequency of correction events (should decrease over time as accuracy improves)
- Repeat question rate: How often the agent asks for information already in its memory (should be ~0)
8.3 Ablations
- No memory (baseline) vs. Working Memory only vs. full three-tier
- Fixed scaffold schema vs. schema-free ontology discovery
- Immediate write (synchronous) vs. event bus (asynchronous consolidation)
- With/without vector retrieval at query time
- Single consolidation model vs. tier-specific models
9. Open Research Questions
- Forgetting: When should the agent remove facts from memory? Human memory benefits from strategic forgetting; does agent memory?
- Privacy boundaries: How to handle sensitive information the user mentions in passing — should the agent remember everything, or apply sensitivity-aware retention policies?
- Schema convergence: Does the schema-free ontology eventually converge to a stable structure, or does it drift indefinitely? Is convergence even desirable?
- Cross-user knowledge: Can anonymized patterns from one user's scaffold improve cold-start for similar users (collaborative filtering for agent memory)?
- Memory conflicts: When autonomous and interactive paths produce contradictory observations, which should win? How to detect and resolve conflicts?
- Consolidation frequency: What is the optimal batch processing interval? Too frequent wastes compute; too infrequent makes the agent feel "forgetful"
- Working memory capacity: Is 500 words the right cap? Too short loses nuance; too long wastes tokens on every request. How to dynamically adjust based on user complexity?