Read our latestState of AI development report– explore trends, usage, and emerging patterns!Read the results →
Vellum Product Update | October 2024
Nov 1, 2024
5 min
No items found.
Table of Contents

November is a month for crisp fall weather, giving thanks, and another round of Vellum product updates! In October, we shipped a ton of new models, improvements to Evals, Prompts, Workflows, and more.

Hold the gravy, let’s dive in and see what’s new 🎃

Online Evaluations for Workflow and Prompt Deployments

Previously, you could only run “Offline Evaluations” or “Inline Evaluations.” You can run Offline Evaluations manually when you want to check Prompt / Workflow performance, e.g. when you’re getting ready to make a new Production Release. Inline Evaluations are useful if you want to check quality during a Workflow’s execution and conditionally do something within the Workflow (retry a prompt, throw an error or Slack alert, escalate to a human, etc.)

But what if you want to monitor how your product performs live in production? Now you can!

Online Evaluations help you see your product’s performance in real time. They run on every production execution of your app, helping you catch & resolve edge-cases faster, and prevent regressions more thoroughly. The best part – you can use Vellum premade Metrics, or Custom Metrics that you’ve already configured!

Configure Metrics in the new “Metrics” Tab
Every execution of your Prompt or Workflow now gets evaluated with your Metric

You can read more about Online Evaluations here!

Configurable Prompt Node Timeouts

Previously, if you wanted to avoid having a single Prompt node slow down your workflow, you’d need to setup a few nodes and cumbersome logic to time out early.

Now, you can easily set maximum timeouts for Prompt Nodes within Workflows, preventing bottlenecks and ensuring efficient resource management.

Configuring Timeouts in Prompt Nodes

AutoLayout and AutoConnect for Workflows

As you experiment and your workflows become more complex, keeping them organized will make them easier to iterate on. Now, you can automatically organize and connect nodes in Workflow Sandboxes with just a click.

How to use AutoLayout in Workflows

Datadog and Webhook Logging Beta Integrations

If you want deeper insights into key events happening in Vellum, but in the context of the rest of your systems, now you have it with our Datadog & Webhook Logging integrations (in beta). For example, you can set up a Datadog alert to fire when there are multiple subsequent failures when executing a Workflow Deployment.

If you’d like to participate in the Beta Period and want help setting up their integration, please contact us!

New Models and Providers!

Model optionality gives builders more flexibility to optimize for accuracy, latency, and cost, as use-cases necessitate. Here’s a quick overview of the 25 (!!) new models we added in October:

  • All Perplexity models — including Online models for searching the web!
  • Cerebras — featuring 2,100 tokens/sec. That’s 3x faster than the current state of the art, or nearly 3 books per minute!
  • 13 new OpenRouter models
  • The newest Claude 3.5 Sonnet
  • Gemini 1.5 Flash 8B

Other noteworthy mentions:

  • Vertex AI embedding models: text-embedding-004 and text-multilingual-embedding-002
  • OpenAI Prompt Caching for GPT-4o and GPT-o1 models

Click here to see more details about the new models we’re supporting.

Evaluations

Reorder Test Suite Variables

You can now reorder Input and Evaluation Variables within a Test Suite’s settings page, helping you stay organized & make changes faster by putting related values next to one another.

Test Suite Input & Output Variables Configuration

Reorder Entities in Evaluation Reports

When your Evaluation Reports use many Metrics, often you want to see related Metrics grouped nearby one another. You can now reorder entities in the Evaluation Report table, making it easier to triage your Metric scores and iterate on your Prompts & Workflows accordingly.

Evaluation Report - Metric Types Ordering

Filter and Sort on Metric Scores

You can now filter and sort on a Metric’s score within Evaluation Reports. This makes it easier to find all Test Cases that fall below a given Metric threshold, so you can iterate and improve your products’ robustness faster.

Evaluation Report - Test Case Sorting by Metric Scores

Prompts, Models, and Embeddings

Prompt Caching Support for OpenAI

OpenAI now automatically performs prompt caching to help optimize cost & latency of prompts. In Vellum, we capture the new Cache Tokens when using supported OpenAI models, to help you analyze cache hit rates and optimize LLM spend.

Vertex AI Embedding Model Support

We now support Vertex AI Embedding Models: text-embedding-004 and text-multilingual-embedding-002, giving you more options to optimize your RAG pipelines.

New Models!

That’s right, 25 new models.

Provider / Router Model Description
Cerebras llama3.1-8b Lightning fast at 1,800 tokens/sec
Cerebras llama3.1-70b Lightning fast at 2,100 tokens/sec
Anthropic Claude 3.5 Sonnet 2024-10-22 Across the board improvements but particularly in coding
Anthropic Claude 3.5 Haiku Matches the performance of Claude 3 Opus, Anthropic’s prior largest model, on most evaluations
Perplexity LLama 3.1 Sonar Small 128k Online Designed for efficient online interactions, balancing performance and resource usage
Perplexity LLama 3.1 Sonar Large 128k Online Offers enhanced capabilities for more complex tasks, with improved accuracy and depth
Perplexity LLama 3.1 Sonar Huge 128k Online Excels in handling intricate queries requiring deep reasoning and extensive context
Perplexity Perplexity AI: LLama 3.1 Sonar Small 128k Chat Optimized for chat applications, facilitating fluid and contextually rich conversations
Perplexity Perplexity AI: LLama 3.1 Sonar Large 128k Chat Enhances chat interactions with greater depth and understanding for demanding conversational agents
Perplexity Perplexity AI: LLama 3.1 8B Instruct Instruction-tuned model ideal for tasks requiring precise adherence to instructions
Perplexity Perplexity AI: LLama 3.1 70B Instruct Handles complex tasks with high accuracy and reliability
Google Gemini 1.5 Flash 8B Lightweight, smaller, and faster
OpenRouter Magnum v2 72B Designed to achieve prose quality similar to Claude 3 models
OpenRouter Magnum v4 72B Replicates prose quality of the Claude 3 models, specifically Sonnet and Opus
OpenRouter LLama 3.1 Lumimaid 70B Lumimaid v0.2 70B is a fine-tune with a substantial dataset upgrade compared to Lumimaid v0.1
OpenRouter Nous: Hermes 3 405B Instruct A frontier-level, full-parameter fine-tune of the Llama-3.1 405B foundation model
OpenRouter NousResearch: Hermes 2 Pro - Llama-3 8B An upgraded version of Nous Hermes 2 with improved capabilities
OpenRouter Nous: Hermes 3 405B Instruct (extended) An extended context version of Hermes 3 405B Instruct
OpenRouter Goliath 120B Combines two fine-tuned Llama 70B models
OpenRouter Dolphin 2.9.2 Mixtral 8x22B An uncensored model for instruction following, conversation, and coding
OpenRouter Anthropic: Claude 3.5 Sonnet (self-moderated) A faster, self-moderated endpoint of Claude 3.5 Sonnet
OpenRouter Liquid: LFM 40B MoE A 40.3B Mixture of Experts (MoE) model for general-purpose AI tasks
OpenRouter Eva Qwen 2.5 14B A powerful model based on the Qwen architecture
OpenRouter Rocinante 12B A versatile 12 billion parameter model
OpenRouter WizardLM-2 8x22B Microsoft AI's most advanced Wizard model

Deployments:

New API for Listing Entities in a Folder

Now you can programmatically retrieve all entities in a folder via API. The response lists these entities along with high-level metadata about them.

This new API is available in our SDKs beginning with version 0.8.25. For additional details, check out our API Reference here.

Quality of Life Improvements

Workflow Edge Type Improvements

Edges between Nodes in Workflows could appear jagged or misaligned, making it difficult to visualize connections. With this new improvement, edges now snap into straight-line connectors when they are close to horizontal.

See you in December!

That’s all for now folks. We hope you have a wonderful November, filled with lots of food & fall activities. See ya in December!

PSA - sign up for our newsletter to get these updates in right your inbox!

ABOUT THE AUTHOR
Noa Flaherty
Co-founder & CTO

Noa Flaherty, CTO and co-founder at Vellum (YC W23) is helping developers to develop, deploy and evaluate LLM-powered apps. His diverse background in mechanical and software engineering, as well as marketing and business operations gives him the technical know-how and business acumen needed to bring value to nearly any aspect of startup life. Prior to founding Vellum, Noa completed his undergrad at MIT and worked at three tech startups, including roles in MLOps at DataRobot and Product Engineering at Dover.

No items found.
The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska
Head of Engineering

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Jeremy Hicks
Solutions Architect
Related Posts
View More

Experiment, Evaluate, Deploy, Repeat.

AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.