Toolstream Adaptive Memory Architecture (AMA)

Design Specification v0.1

Purpose

Implement a multi-tier memory system that enables a production AI agent to build, maintain, and recall a persistent understanding of its user across all interaction surfaces — interactive chat, autonomous background processing, and inbox analysis.

The architecture must support:

Continuous learning from all interaction channels without explicit user instruction
LLM-mediated memory consolidation that synthesizes raw observations into structured knowledge
Semantic deep recall of facts, preferences, and context at query time
Schema-free user modeling where the agent organically discovers and evolves its own ontology per user

The system is inspired by human memory consolidation: observations are captured in real-time, buffered as events, and periodically synthesized by a background process — analogous to how short-term memories consolidate into long-term storage during sleep.

1. Core Concepts

1.1 Memory Tiers

Define three tiers M = {W, S, V}, each serving a distinct cognitive purpose:

Tier	Name	Storage	Purpose	Analogy
W	Working Memory	Text column	Continuously LLM-synthesized narrative of current context, projects, and priorities. Capped at ~500 words.	Prefrontal cortex — "what am I thinking about right now"
S	Information Scaffold	JSON column	Dynamic, schema-free structured profile. Keys are created organically (e.g., Role, Goals, Projects, Preferences). The agent invents its own ontology per user.	Declarative memory — "what I know about this person"
V	Vector Long-Term Memory	ChromaDB collection	Semantic embedding store with rich metadata filtering for deep recall of facts, attachment content, and tool outputs.	Episodic memory — "something I learned once and can retrieve if relevant"

1.2 Event Bus Model

All writes to memory — from both interactive and autonomous paths — flow through a unified event bus:

Observation → ContextEvent (DB row, processed=false) → MemoryProcessor → {W, S, V}

This ensures:

Consistent memory quality regardless of source
Decoupled write and synthesis paths
Audit trail of every observation

1.3 Write Paths

Two distinct write paths feed the same event bus:

Interactive Path: The context_tool.record_observation tool, invoked by the LLM during chat when it detects new user information (facts, preferences, project details, corrections).
Autonomous Path: The process_autonomous_cycle background worker, which creates ContextEvent records of type scaffold_update and learned_fact when processing inbox items.

1.4 Read Paths

Memory is consumed in two ways:

Injection: At the start of every chat turn, Working Memory and Information Scaffold are injected directly into the system prompt as context. This ensures every conversation is grounded in accumulated knowledge.
Retrieval: The ContextService performs semantic search against the Vector store at query time, returning the top-N most relevant long-term facts. These are appended to the system prompt as "Relevant Retrieved Facts."

2. Requirements

2.1 Functional Requirements

Multi-channel observation capture:
Record observations from interactive chat (context_tool.record_observation), autonomous cycles (ContextEvent with scaffold_update / learned_fact), and tool outputs (e.g., email attachment content indexed into vector store)
Asynchronous LLM-mediated consolidation:
A background MemoryProcessor synthesizes pending observations into all three memory tiers using structured JSON output from the LLM
Semantic retrieval with metadata filtering:
Query-time retrieval must filter by user_id, connection_id, and context_type — returning only facts relevant to the active user, their active connections, and the query semantics
Schema-free ontology discovery:
The Information Scaffold must support arbitrary, evolving key hierarchies without predefined schema. The LLM decides what categories to create
Connection-scoped learned information:
Each tool connection maintains its own persistent_learned_info JSON column, enabling per-tool preference storage (e.g., "default Jira project for this connection is PROJ")
Graceful degradation:
System must function when any tier is empty (new user cold-start) or unavailable (ChromaDB down)

2.2 Non-Functional Requirements

Latency: Memory injection ≤ 50ms (JSON serialization of working_memory + information_scaffold). Vector retrieval ≤ 300ms per query (amortized, cached embeddings ok)
Cost control: All embedding operations deducted from user credit balance via system_embeddings plugin configuration. The _deduct_cost method tracks per-operation cost against user balance
Idempotency: ContextEvent records are marked processed=true after successful synthesis. Failed synthesis leaves events unprocessed for retry
Auditability: Every observation includes source_metadata (source, session_id, connection_id). Every synthesis run produces AgentContextLog entries

3. System Architecture

A) Observation Capture Layer

Two entrypoints feed the event bus:

Interactive (context_tool.record_observation):

User says something → LLM detects fact → calls record_observation(content, category)
→ ContextEvent(event_type=category, content=content, processed=false) saved to DB
Categories: fact, preference, project_details, correction

Autonomous (process_autonomous_cycle):

Inbox items arrive → Autonomous LLM analyzes → calls context_tool__update_user_context
→ ContextEvent(event_type="scaffold_update" | "learned_fact", content=...) saved to DB

B) Memory Processor (Consolidation Engine)

A background job (memory_runner.py) periodically:

Fetches all ContextEvent where processed = false
Groups by user_id
For each user, invokes an LLM with:
- Current Working Memory state
- Current Information Scaffold state
- New observations since last run

Receives structured JSON output:

{
    "updated_scaffold": { "Role": "Engineering Manager", "Projects": {...} },
    "updated_working_memory": "User is currently focused on...",
    "facts_to_embed": ["User prefers short emails", "Default Jira project is PROJ"]
}

Applies updates atomically:
- User.information_scaffold ← merged structured profile (with recursive JSON normalization for LLM-produced stringified values)
- User.working_memory ← rewritten narrative
- facts_to_embed[] → each fact embedded via text-embedding-3-small and stored in ChromaDB with full metadata
Marks all processed events as processed = true

C) Context Injection Pipeline

At every chat turn (_prepare_and_load_history), the system assembles the agent's context:

1. Working Memory (narrative text)           → System prompt "# Your Context" block
2. Information Scaffold (serialized JSON)    → "Structured Data:" sub-block
3. Persistent Learned Info (serialized JSON) → "Learned Info:" sub-block
4. Vector Retrieval (semantic search)        → "Relevant Retrieved Facts:" block
5. Agent Identity (name, constitution)       → "# Agent Identity" block
6. TGL Directives                            → "# Temporal Governance Directives" block
7. Artifact Index                            → "Your Artifacts" block

Steps 1–3 are deterministic (fast, always available). Step 4 is semantic (requires ChromaDB + embedding API).

D) Vector Store (ContextService)

Backed by ChromaDB with a single collection (toolstream_context).

Metadata Schema:

context_type ∈ {user_profile, user_persistent, tool_schema, tool_entity,
                tool_persistent, tool_attachment_content}
user_id       (required, always filtered)
org_id        (optional, for future multi-tenancy)
tool_id       (optional, scopes facts to a specific tool)
connection_id (optional, scopes facts to a specific connection instance)
source ∈      {user_provided, system_learned, tool_indexed, tool_extracted_content}
message_id    (optional, links to originating message)
attachment_id (optional, links to originating attachment)
filename      (optional, for attachment content)

Retrieval Filter Logic:

WHERE user_id = current_user
  AND (connection_id IN active_connections OR connection_id IS NULL)

This ensures user-level facts are always returned while tool-specific facts are scoped to active connections.

E) Connection-Scoped Memory

Beyond user-level memory, each Connection object maintains:

persistent_learned_info (JSON): Tool-specific preferences (e.g., "default_project: PROJ", "preferred_issue_type: Task")
ConnectionContext (table): Cached tool schema/metadata with TTL-based refresh

Updated via context_tool.update_connection_context, which allows the agent to store per-connection learned information during chat.

4. Data Model

4.1 User Memory Fields

-- On the User table:
working_memory           TEXT      -- AI-synthesized summary of user context (~500 words)
information_scaffold     JSON      -- Schema-free structured profile (dynamic keys)
persistent_learned_info  JSON      -- Adaptive, unstructured learned info

4.2 ContextEvent (Event Bus)

CREATE TABLE context_events (
    id                UUID PRIMARY KEY,
    user_id           UUID NOT NULL REFERENCES users(id),
    organization_id   UUID NOT NULL REFERENCES organizations(id),
    event_type        VARCHAR(50) NOT NULL,   -- scaffold_update, learned_fact, fact, preference, etc.
    content           TEXT NOT NULL,           -- Raw observation text
    source_metadata   JSON DEFAULT '{}',      -- {source, session_id, connection_id, ...}
    processed         BOOLEAN DEFAULT false,
    created_at        TIMESTAMPTZ DEFAULT now()
);

4.3 Vector Store Document

document_id:  UUID5(content + metadata)   -- deterministic, idempotent
embedding:    float[1536]                 -- text-embedding-3-small
document:     text                        -- original observation text
metadata:     ContextMetadata             -- see §3.D

4.4 Connection Memory Fields

-- On the Connection table:
persistent_learned_info  JSON  DEFAULT '{}'  -- Per-tool learned preferences

5. Algorithms

5.1 Memory Consolidation (MemoryProcessor)

Input:  current_scaffold, current_working_memory, new_observations[]
Model:  Gemini 3 Flash (fast, cheap)
Output: {updated_scaffold, updated_working_memory, facts_to_embed[]}

Process:

Fetch all ContextEvent WHERE processed = false, GROUP BY user_id
For each user:
1. Serialize current scaffold and working memory
2. Format new observations as bullet list with event_type labels
3. Send structured prompt to LLM requesting JSON output
4. Parse response, apply recursive JSON normalization
5. Write scaffold and working_memory to User row
6. Embed each fact_to_embed with source=system_learned
7. Mark events as processed = true
8. Commit transaction

JSON Normalization: LLMs sometimes return nested data as stringified JSON (e.g., {"Projects": "{\"toolstream\":\"A project\"}"}). The _normalize_json_values function recursively unwraps these into proper nested dicts before storage.

5.2 Semantic Retrieval (ContextService)

Input:  query_text, user_id, active_connection_ids[], n_results
Model:  text-embedding-3-small (OpenAI)
Store:  ChromaDB

Process:

Generate embedding for query_text
Query ChromaDB with query_embedding, WHERE filter (user_id AND connection_id scope), and n_results (default: 5)
Deduplicate by document_id
Sort by distance (ascending = most relevant)
Return [{content, metadata, distance}]

5.3 Context Assembly (ToolExecutor)

Input:  User object, active connections, user query text
Output: Fully assembled system prompt

Process:

Format working_memory + scaffold + persistent_info into context block
Render system prompt template with context block and timestamp
Inject agent identity (name, constitution) if configured
Inject artifact index if artifacts exist
Inject TGL directives (caution level, tone, framing) if available
Semantic retrieval: embed user query → ChromaDB → top-5 results
Append retrieved facts as "Relevant Retrieved Facts" block
Register context_tool.record_observation as available LLM tool

5.4 Observation Routing

During interactive chat, when the LLM calls context_tool.record_observation:

Create ContextEvent with event_type = category, content, source_metadata = {source: "interactive", session_id: ...}, processed = false
Return success to LLM, continue conversation
MemoryProcessor picks up event asynchronously

During autonomous cycles:

LLM calls context_tool__update_user_context with scaffold_updates and/or persistent_info_additions
For each key-value in scaffold_updates: Create ContextEvent(event_type="scaffold_update", ...)
For each item in persistent_info_additions: Create ContextEvent(event_type="learned_fact", ...)
Commit, MemoryProcessor synthesizes later

6. Interaction with Other Systems

6.1 Temporal Governance Layer (TGL)

The TGL consumes memory state as input to its State Estimator. The user's goals (from the Information Scaffold) and current context (from Working Memory) inform horizon weight calculations and chat directive shaping.

6.2 Inbox Ranking

The InboxRanker.score_multiple method uses the user's autonomous_inbox_guidance (a memory-adjacent field) to modulate LLM-based importance scoring. As the scaffold accumulates project and priority data, the autonomous system becomes better at scoring items.

6.3 Playbook Engine

Playbooks can reference user context through Jinja2 templates. Knowledge accumulated in memory indirectly improves playbook execution by providing richer system prompts during tool execution.

7. Implementation Plan

Phase 1: Foundation (Shipped)

Three-tier storage: working_memory, information_scaffold, persistent_learned_info on User table
ContextEvent table and event bus pattern
MemoryProcessor with LLM-mediated consolidation (Gemini 3 Flash)
ContextService with ChromaDB and text-embedding-3-small embeddings
context_tool.record_observation for interactive capture
context_tool.update_connection_context for per-connection learning
Context injection pipeline in ToolExecutor._prepare_and_load_history
Semantic retrieval at query time

Phase 2: Enrichment (In Progress)

Autonomous write path via process_autonomous_cycle feeding ContextEvent bus
Recursive JSON normalization for LLM-produced scaffold values
AgentContextLog for full audit trail of injected context per request
Cost tracking for all embedding operations via system_embeddings plugin

Phase 3: Advanced (Planned)

Confidence scoring on scaffold entries (how certain is the agent about each fact?)
Temporal decay on Working Memory entries (automatically age out stale context)
Contradiction detection (new facts that conflict with existing scaffold entries)
User-facing memory inspector (show the user what the agent has learned)
Cross-session learning (synthesize patterns across multiple conversations)
Memory-aware tool suggestion (recommend tools based on learned user patterns)

8. Evaluation & Experiments

8.1 Offline Metrics

Scaffold accuracy: Agreement between agent-learned facts and ground truth user profiles
Retrieval relevance: Precision@5 and NDCG@5 for semantic retrieval against labeled query-fact pairs
Consolidation quality: LLM judge scoring of working memory narratives for completeness, conciseness, and recency
Cold-start convergence: Number of interactions required to build a useful scaffold from empty state

8.2 Online Metrics

Context utilization rate: Proportion of injected facts that appear in the agent's final response
Observation rate: Average context_tool.record_observation calls per conversation
Scaffold growth curve: Key count over time per user (healthy growth should be logarithmic, not linear)
User correction rate: Frequency of correction events (should decrease over time as accuracy improves)
Repeat question rate: How often the agent asks for information already in its memory (should be ~0)

8.3 Ablations

No memory (baseline) vs. Working Memory only vs. full three-tier
Fixed scaffold schema vs. schema-free ontology discovery
Immediate write (synchronous) vs. event bus (asynchronous consolidation)
With/without vector retrieval at query time
Single consolidation model vs. tier-specific models

9. Open Research Questions

Forgetting: When should the agent remove facts from memory? Human memory benefits from strategic forgetting; does agent memory?
Privacy boundaries: How to handle sensitive information the user mentions in passing — should the agent remember everything, or apply sensitivity-aware retention policies?
Schema convergence: Does the schema-free ontology eventually converge to a stable structure, or does it drift indefinitely? Is convergence even desirable?
Cross-user knowledge: Can anonymized patterns from one user's scaffold improve cold-start for similar users (collaborative filtering for agent memory)?
Memory conflicts: When autonomous and interactive paths produce contradictory observations, which should win? How to detect and resolve conflicts?
Consolidation frequency: What is the optimal batch processing interval? Too frequent wastes compute; too infrequent makes the agent feel "forgetful"
Working memory capacity: Is 500 words the right cap? Too short loses nuance; too long wastes tokens on every request. How to dynamically adjust based on user complexity?