Vellum is coming to the AI Engineering World's Fair in SF. Come visit our booth and get a live demo!

How we cut model costs by >90% by swapping LoRA weights dynamically

Dynamically swapping LoRA weights can significantly lower costs of a fine tuned model

Reviewed by
No items found.

tl;dr: We’ve been working on fine tuning of open source models and wanted to share a technique which helped significantly reduce costs of serving specific fine tuned models. This is only possible if you have enough usage to keep a GPU fully utilized.

Our views on fine tuning

A few weeks ago we wrote a blog on why fine-tuning is making a comeback in the world of LLMs. As a recap, fine-tuning involves training a pre-existing model on a smaller, task-specific dataset to adapt it to a particular task or domain. The foundation model, a pre-trained LLM like Llama-2-7b, serves as the initial starting point. All weights of this network are then further optimized based on the data specific to the task at hand. The result is a model that uses its pre-trained proficiency in general language to become an expert at the specific task. The fine-tuned model has better performance on specific tasks, lower cost & latency, and improved privacy.

We’ve started working on fine tuning these models for our customers and wanted to share some early learnings.

What is LoRA in the context of fine tuning?

LoRA (Low Rank Adaption of LLMs) is a technique where you only need to add a small number of extra parameters (< 1%) and fine tune those. In the case of this Llama-2-7b model, fewer than 70m parameters would need to be trained. This makes the training process much faster and cheaper, and the model does surprisingly well on most tasks. The end result is we now have a small adapter that can be added to the base model to achieve high performance on the target task. Swapping only the LoRA weights instead of all parameters allows cheaper switching between tasks. Multiple customized models can be created on one GPU and swapped in and out easily.

How LoRA can help reduce costs if you have multiple tasks

Let’s use an analogy inspired by a garden hose. Do you recall seeing a garden hose which can take various adapters like a regular stream, a jet, a cone, a mist, a shower, etc.? The same garden hose can spray water in different ways depending on the adapter you choose. If the total demand for water can be fulfilled by the water going to one hose you can serve varying use cases with these adapters (there’s no need for 12 different hoses for 12 use cases).

Dynamically swapping LoRA weights for fine tuned tasks works in a similar way. The foundation model is served on one GPU which is always running and can swap between different LoRA weights as needed. As long as enough models are served that the GPU will always be warm and utilized, cost can be split across all the use cases.

However, for most companies, it’s difficult to fully occupy a GPU’s capacity. Costs add up if your GPU is sitting idle. If you only selectively use the GPU then you have to overcome cold start problem, adding latency of up to 5 minutes (‼️)

This is where an aggregate like Vellum comes in. We serve enough use cases across customers to always keep GPUs occupied. In the low usage limit (i.e., when an individual fine tuned model is not used too much), cost per request goes down by ~99% and additional latency is only 50ms.

Next steps

If you’re interested in exploring a lower cost alternative to your current fine tuned model or prompt based model please reach out to me at akash@vellum.ai. At Vellum we abstract away the complexities with training models and make them extremely easy to use.

ABOUT THE AUTHOR
Sidd Seethepalli
Co-founder and CTO

Sidd Seethepalli, CTO and co-founder at Vellum (YC W23) is very passionate about LLM product development, and is constantly pushing the boundaries of what’s possible with current models and techniques for more than 100 customers at Vellum who use LLMs in production. Before starting Vellum, Sidd completed his undergrad at the Massachusetts Institute of Technology, then spent 4 years working for well known tech companies like Quora and Dover.

ABOUT THE reviewer

No items found.
lAST UPDATED
Aug 3, 2023
share post
Expert verified
Related Posts
LLM basics
January 28, 2026
20 min
2026 Marketer's Guide to AI Agents for Marketing Operations
LLM basics
January 26, 2026
18 min
Top 20 AI Agent Builder Platforms (Complete 2026 Guide)
Product Updates
January 13, 2026
5 min
Introducing Vellum for Agents
January 10, 2026
8 min
Vellum Product Update | December
All
December 12, 2025
7 min
How we use coding agents to 2x engineering output
LLM basics
December 12, 2025
8 min
GPT-5.2 Benchmarks
The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska
Head of Engineering

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Jeremy Hicks
Solutions Architect

Experiment, Evaluate, Deploy, Repeat.

AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component, Use {{general-cta}}

Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component  [For enterprise], Use {{general-cta-enterprise}}

The best AI agent platform for enterprises
Production-grade rigor in one platform: prompt builder, agent sandbox, and built-in evals and monitoring so your whole org can go AI native.

[Dynamic] Ebook CTA component using the Ebook CMS filtered by name of ebook.
Use {{ebook-cta}} and add a Ebook reference in the article

Thank you!
Your submission has been received!
Oops! Something went wrong while submitting the form.
Button Text

LLM leaderboard CTA component. Use {{llm-cta}}

Check our LLM leaderboard
Compare all open-source and proprietary model across different tasks like coding, math, reasoning and others.

Case study CTA component (ROI) = {{roi-cta}}

40% cost reduction on AI investment
Learn how Drata’s team uses Vellum and moves fast with AI initiatives, without sacrificing accuracy and security.

Case study CTA component (cutting eng overhead) = {{coursemojo-cta}}

6+ months on engineering time saved
Learn how CourseMojo uses Vellum to enable their domain experts to collaborate on AI initiatives, reaching 10x of business growth without expanding the engineering team.

Case study CTA component (Time to value) = {{time-cta}}

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.

[Dynamic] Guide CTA component using Blog Post CMS, filtering on Guides’ names

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.
New CTA
Sorts the trigger and email categories

Dynamic template box for healthcare, Use {{healthcare}}

Start with some of these healthcare examples

Clinical trial matchmaker
Match patients to relevant clinical trials based on EHR.
Claims compliance review agent
Examines claim submissions for compliance and recommends corrections

Dynamic template box for insurance, Use {{insurance}}

Start with some of these insurance examples

AI agent for claims review
Review healthcare claims, detect anomalies and benchmark pricing.
Agent that summarizes lengthy reports (PDF -> Summary)
Summarize all kinds of PDFs into easily digestible summaries.
Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.

Dynamic template box for eCommerce, Use {{ecommerce}}

Start with some of these eCommerce examples

E-commerce shopping agent
Check order status, manage shopping carts and process returns.

Dynamic template box for Marketing, Use {{marketing}}

Start with some of these marketing examples

Creative content generator agent
Give it a URL and a format, and it turns the source into finished creative content.
Competitor research agent
Scrape relevant case studies from competitors and extract ICP details.

Dynamic template box for Sales, Use {{sales}}

Start with some of these sales examples

Objection capture agent for sales calls
Take call transcripts, extract objections, and update the associated Hubspot contact record.
Closed-lost deal review agent
Review all deals marked as "Closed lost" in Hubspot and send summary to the team.

Dynamic template box for Legal, Use {{legal}}

Start with some of these legal examples

Compliance review agent
Checks DPAs and privacy policies against your compliance checklist then scores coverage and make a plan.
Legal RAG chatbot
Chatbot that provides answers based on user queries and legal documents.

Dynamic template box for Supply Chain/Logistics, Use {{supply}}

Start with some of these supply chain examples

Risk assessment agent for supply chain operations
Comprehensive risk assessment for suppliers based on various data inputs.

Dynamic template box for Edtech, Use {{edtech}}

Start with some of these edtech examples

No items found.

Dynamic template box for Compliance, Use {{compliance}}

Start with some of these compliance examples

No items found.

Dynamic template box for Customer Support, Use {{customer}}

Start with some of these customer support examples

Customer support agent
Support chatbot that classifies user messages and escalates to a human when needed.
Ticket Escalation Bot
Detect escalated support tickets and assigns them in Linear.

Template box, 2 random templates, Use {{templates}}

Start with some of these agents

Retail pricing optimizer agent
Analyze product data and market conditions and recommend pricing strategies.
Customer support agent
Support chatbot that classifies user messages and escalates to a human when needed.

Template box, 6 random templates, Use {{templates-plus}}

Build AI agents in minutes

Review Comment Generator for GitHub PRs
Use predefined guidelines to write a code review comment for a GitHub PR.
Competitor research agent
Scrape relevant case studies from competitors and extract ICP details.
Financial Statement Review Workflow
Extract and review financial statements and their corresponding footnotes from SEC 10-K filings.
NDA deviation review agent
Reviews NDAs against your standard template, highlights differences, and sends a risk rated summary to Slack.
Legal RAG chatbot
Chatbot that provides answers based on user queries and legal documents.
Legal contract review AI agent
Asses legal contracts and check for required classes, asses risk and generate report.

Build AI agents in minutes for

{{industry_name}}

Roadmap planner
Agent that reviews your roadmap and suggests changes based on team capacity.
Account monitoring agent
Combines product usage data with CRM data from HubSpot or Salesforce to flag accounts with declining usage, especially ahead of renewals.
Cross team status updates
Scans Linear for stale, blocked, or repeatedly reopened issues, flags patterns, and uses Devin to propose cleanup or refactor suggestions.
SEO article generator
Generates SEO optimized articles by researching top results, extracting themes, and writing content ready to publish.
Stripe transaction review agent
Analyzes recent Stripe transactions for suspicious patterns, flags potential fraud, posts a summary in Slack.
KYC compliance agent
Automates KYC checks by reviewing customer documents stored in HubSpot

Case study results overview (usually added at top of case study)

What we did:

1-click

This is some text inside of a div block.

28,000+

Separate vector databases managed per tenant.

100+

Real-world eval tests run before every release.