A Vellum Assistant is an AI-powered agent that can perform tasks on your behalf. Each assistant is backed by a large language model and is composed of several configurable subsystems that control how it thinks, remembers, acts, and stays safe.
When you hatch an assistant, all of these subsystems are initialized with sensible defaults. You can customize any of them through the assistant's configuration file.
The assistant's behavior is driven by a configuration file that lives in your workspace. Core settings control which model powers the assistant and how it manages its context.
The LLM provider and model that powers the assistant. Managed assistants use Anthropic (Claude models) by default.
Self-hosted assistants can configure alternative providers via the assistant's config file:
Self-hosted users can also configure a providerOrder to define automatic failover across providers if the primary is unavailable.
Controls how the assistant manages conversation length. When the conversation grows beyond the configured token limits, older messages are automatically compacted into summaries while preserving the most recent turns.
Extended thinking gives the assistant a dedicated token budget to reason through complex problems before responding. When enabled, the model uses a separate “thinking” step that is not shown in the final output.
The memory system gives the assistant long-term recall across conversations. It automatically extracts facts, preferences, and events from messages and stores them in a searchable index. On each new turn, relevant memories are retrieved and injected into the assistant's context.
Memory items are converted into vector embeddings for semantic search. The embedding provider can be configured to use local models (for privacy), or cloud providers like OpenAI or Gemini (for quality). Embeddings are stored in a Qdrant vector database that runs alongside the assistant.
Retrieval combines lexical (keyword) and semantic (vector) search to find the most relevant memories. Results are optionally re-ranked by a lightweight LLM to ensure precision. The system supports dynamic budget allocation — injecting fewer memories when the conversation is already long, and more when there is headroom.
The assistant can extract entities (people, places, organizations) and their relationships from conversations, building a knowledge graph over time. During retrieval, the graph is traversed to surface related context that keyword or semantic search alone might miss.
Skills are modular capabilities that extend what the assistant can do. Each skill bundles a set of tools, instructions, and optional configuration into a self-contained unit. See the Skills documentation for a deep dive on how skills work.
When the assistant needs to execute code, it can use a sandboxed environment. The default sandbox backend is Docker, which provides an isolated container with configurable CPU, memory, and network limits. This ensures that code execution is safe and does not affect the host system.
For complex tasks that benefit from parallelism, the assistant can spawn a swarm of worker agents. A planner decomposes the task into subtasks, each handled by an independent worker. Results are synthesized back into a unified response. Swarm execution is configurable with limits on the number of concurrent workers and per-task timeouts.
Multiple safety layers protect both you and the assistant's runtime:
Automatically scans messages for API keys, passwords, and other sensitive data. Detected secrets can be redacted or blocked before they are stored or sent to the model.
Controls which tools and actions the assistant is allowed to invoke. The permission system supports workspace-level policies so that high-risk actions require explicit guardian approval.
Configurable limits on requests per minute and tokens per session prevent runaway usage and keep costs predictable.