Assistant - Vellum Docs

Overview

A Vellum Assistant is an AI-powered agent that can perform tasks on your behalf. Each assistant is backed by a large language model and is composed of several configurable subsystems that control how it thinks, remembers, acts, and stays safe.

When you hatch an assistant, all of these subsystems are initialized with sensible defaults. You can customize any of them through the assistant's configuration file.

Configuration

The assistant's behavior is driven by a configuration file that lives in your workspace. Core settings control which model powers the assistant and how it manages its context.

Provider

The LLM provider and model that powers the assistant. Managed assistants use Anthropic (Claude models) by default.

Self-hosted assistants can configure alternative providers via the assistant's config file:

Anthropic — Claude models (default)
OpenAI — GPT models
Google — Gemini models
Ollama — Local open-source models
OpenRouter — Access multiple providers through a single API
Fireworks — Optimized inference for open models

Self-hosted users can also configure a providerOrder to define automatic failover across providers if the primary is unavailable.

Context Window

Controls how the assistant manages conversation length. When the conversation grows beyond the configured token limits, older messages are automatically compacted into summaries while preserving the most recent turns.

maxInputTokens: The hard ceiling on input tokens sent to the model.
compactThreshold: Fraction of maxInputTokens that triggers compaction.
preserveRecentUserTurns: Number of most-recent user messages that are never compacted.

Thinking

Extended thinking gives the assistant a dedicated token budget to reason through complex problems before responding. When enabled, the model uses a separate “thinking” step that is not shown in the final output.

Memory

The memory system gives the assistant long-term recall across conversations. It automatically extracts facts, preferences, and events from messages and stores them in a searchable index. On each new turn, relevant memories are retrieved and injected into the assistant's context.

Embeddings

Memory items are converted into vector embeddings for semantic search. The embedding provider can be configured to use local models (for privacy), or cloud providers like OpenAI or Gemini (for quality). Embeddings are stored in a Qdrant vector database that runs alongside the assistant.

Retrieval

Retrieval combines lexical (keyword) and semantic (vector) search to find the most relevant memories. Results are optionally re-ranked by a lightweight LLM to ensure precision. The system supports dynamic budget allocation — injecting fewer memories when the conversation is already long, and more when there is headroom.

Entity Knowledge Graph

The assistant can extract entities (people, places, organizations) and their relationships from conversations, building a knowledge graph over time. During retrieval, the graph is traversed to surface related context that keyword or semantic search alone might miss.

Skills

Skills are modular capabilities that extend what the assistant can do. Each skill bundles a set of tools, instructions, and optional configuration into a self-contained unit. See the Skills documentation for a deep dive on how skills work.

Sandbox

When the assistant needs to execute code, it can use a sandboxed environment. The default sandbox backend is Docker, which provides an isolated container with configurable CPU, memory, and network limits. This ensures that code execution is safe and does not affect the host system.

Docker (default): Runs code in an isolated container with configurable CPU, memory, and network access.

Swarm

For complex tasks that benefit from parallelism, the assistant can spawn a swarm of worker agents. A planner decomposes the task into subtasks, each handled by an independent worker. Results are synthesized back into a unified response. Swarm execution is configurable with limits on the number of concurrent workers and per-task timeouts.

Safety

Multiple safety layers protect both you and the assistant's runtime:

Secret Detection

Automatically scans messages for API keys, passwords, and other sensitive data. Detected secrets can be redacted or blocked before they are stored or sent to the model.

Permissions

Controls which tools and actions the assistant is allowed to invoke. The permission system supports workspace-level policies so that high-risk actions require explicit guardian approval.

Rate Limiting

Configurable limits on requests per minute and tokens per session prevent runaway usage and keep costs predictable.