Memory & Context - Vellum Docs

Your assistant remembers you. Not just within a single conversation, but across days, weeks, and months. Here's how.

Two layers of memory

Your assistant has two ways of remembering things, and they serve different purposes.

1. Workspace files — the baseline

These are plain text files in ~/.vellum/workspace/ that define the constants:

SOUL.md — behavioral rules and personality
IDENTITY.md — your assistant's name, nature, and vibe
USER.md — facts about you (name, location, preferences, projects)
NOW.md — working scratchpad for in-progress tasks and session context

Your assistant loads these into every conversation. They're the foundation — the context that makes it feel like it knows you before you've said a word. Your assistant also updates them as it learns new things about you, and you can edit them directly at any time.

2. Long-term memory — the searchable history

Beyond workspace files, your assistant has a memory system that works more like human memory. It extracts facts from your conversations and stores them as searchable, categorized items — each with a natural lifetime:

Kind	Example	Lifetime
Identity	“Marina works at Vellum”	~6 months
Preference	“Hates morning meetings”	~3 months
Journal	“Shipped v0.5.14 today, felt great”	~3 months
Constraint	“Always use TypeScript for new skills”	~1 month
Project	“Working on docs rewrite”	~2 weeks
Decision	“Decided to go with option B”	~2 weeks
Event	“Dentist appointment March 15th”	~3 days

These lifetimes aren't hard cutoffs. Memories that come up across multiple conversations age more slowly — each additional conversation reinforces the memory by about 30%. Memories that go stale get demoted in search results before eventually dropping out.

The journal

Your assistant keeps a journal — a running narrative of what's happening in your life and work. Journal entries are stored as markdown files and indexed into the memory system automatically.

Important entries carry forward across conversations, while routine ones naturally fade. The journal gives your assistant a sense of continuity — not just isolated facts, but an evolving story of what you're working on, what happened recently, and what's coming up.

How it decides what to remember

Your assistant doesn't save everything. After each message, it runs an extraction step that identifies facts worth keeping — with confidence scores, importance ratings, and fingerprints to prevent duplicates.

It extracts when:

You share a personal fact or preference
You make a decision worth tracking
It learns something non-obvious from a task
You correct its behavior
Something seems important for future interactions

Low-value messages (“ok,” “thanks,” “got it”) are filtered out before extraction even runs. It's designed to err on the side of remembering too little rather than too much.

If you want it to remember something specific, just say so:

“Remember that my dentist appointment is on March 15th.”

“Save this: the project deadline is end of Q2.”

When you explicitly ask it to remember something, it saves with high confidence — those memories are less likely to be superseded or go stale.

How it corrects itself

When the assistant extracts a new fact that contradicts an older one — say, you told it you preferred coffee last month but mentioned you've switched to tea — the new memory can supersede the old one. If the correction is explicit (“ Actually, I prefer tea now”), the old memory is replaced immediately. If it's inferred, both coexist until the old one ages out.

Duplicate memories are caught by fingerprinting. If the same fact is extracted again, it reinforces the existing memory rather than creating a copy.

How context works in a conversation

Every time you send a message, your assistant assembles context from multiple sources:

Workspace files — SOUL.md, IDENTITY.md, USER.md, NOW.md, loaded at the start of the conversation
Conversation history — everything said so far in this session (summarized if it gets long)
Memory recall — a search of long-term memory for anything relevant to your message
Active skill instructions — if a skill is loaded, its instructions are included
Your message — what you just said

All of this gets sent to the AI model together. That's how your assistant responds with awareness of who you are, what you've discussed before, and what's relevant right now.

How memory recall works

When you send a message, the assistant doesn't just do a keyword search. It runs a hybrid retrieval pipeline:

Your message is embedded — converted into both a dense vector (capturing meaning) and a sparse vector (capturing keywords)
Both vectors search the memory store — dense search finds semantically similar memories, sparse search finds keyword matches. Results are merged using Reciprocal Rank Fusion.
Scoring — each result gets a composite score combining semantic relevance, recency (using a logarithmic decay curve so older memories aren't wiped out too fast), and extraction confidence
Tiering — high-scoring results get priority injection into the conversation; moderate results are included as “possibly relevant”; lower scores are dropped
Staleness check — memories past their natural lifetime get demoted, even if they scored well
Two-layer injection — relevant memories are formatted and inserted as structured context, split into an identity/preference layer (who you are) and a general context layer (everything else)

The budget for memory injection is dynamic — it expands or contracts based on how much room is left in the context window after workspace files, conversation history, and skill instructions.

What happens when conversations get long

Every AI model has a context window — a limit on how much text it can process at once. Your assistant manages this automatically:

Compaction — when the conversation approaches 80% of the context limit, older messages are summarized into a compact form. The summary preserves goals, decisions, constraints, file paths, errors, and open questions while dropping filler and repetition.
If that's not enough — tool results are truncated to their essentials.
If still tight — images and file contents are replaced with text descriptions.
Last resort — memory injection is scaled back to recent items only.

You won't notice this happening. The assistant keeps the conversation going smoothly — it just works with a summarized version of the earlier context rather than the full transcript.

Private conversations

You can start a private conversation that gets its own isolated memory scope. Memories from a private conversation:

Can't leak out — they won't surface in other conversations
Can read in — the private conversation can still access your shared memory pool

This is useful when you're discussing something sensitive. The assistant learns from the conversation, but those memories stay contained.

Trust and memory

Not everyone who talks to your assistant can shape its memories. Memory extraction only runs on messages from trusted actors — that's you (the guardian). Messages from trusted contacts or unknown parties are indexed for search within that conversation, but they can't create or modify your long-term memories.

This prevents external parties from injecting false facts into your assistant's memory.

Managing your memories

You have full control:

Ask what it knows: “What do you remember about me?” or “What do you know about Project Moonshot?”
Correct mistakes: “Actually, I prefer tea, not coffee.” (It'll supersede the old memory.)
Delete memories: “Forget what I told you about my dentist appointment.”
Search explicitly: “Search your memory for anything about the Q2 deadline.”
Edit files directly: Open USER.md or SOUL.md in any text editor and change whatever you want.

Privacy

Memories are stored locally on your machine — in a SQLite database and a Qdrant vector store inside ~/.vellum/workspace/data/. They don't get synced to a cloud, shared with other users, or used to train AI models.

However, memories are included in the context sent to the AI model when they're relevant to a conversation. This is how your assistant “thinks” with your context. Local storage, cloud thinking — the same trade-off as everything else in the system.

If you tell your assistant something sensitive, it may extract it as a memory and include it in future AI model calls when relevant. You can ask it to forget specific things, edit your workspace files directly, or use private conversations to keep sensitive context isolated.