Vellum is coming to the AI Engineering World's Fair in SF. Come visit our booth and get a live demo!

How we cut model costs by >90% by swapping LoRA weights dynamically

Dynamically swapping LoRA weights can significantly lower costs of a fine tuned model

Reviewed by
No items found.

tl;dr: We’ve been working on fine tuning of open source models and wanted to share a technique which helped significantly reduce costs of serving specific fine tuned models. This is only possible if you have enough usage to keep a GPU fully utilized.

Our views on fine tuning

A few weeks ago we wrote a blog on why fine-tuning is making a comeback in the world of LLMs. As a recap, fine-tuning involves training a pre-existing model on a smaller, task-specific dataset to adapt it to a particular task or domain. The foundation model, a pre-trained LLM like Llama-2-7b, serves as the initial starting point. All weights of this network are then further optimized based on the data specific to the task at hand. The result is a model that uses its pre-trained proficiency in general language to become an expert at the specific task. The fine-tuned model has better performance on specific tasks, lower cost & latency, and improved privacy.

We’ve started working on fine tuning these models for our customers and wanted to share some early learnings.

What is LoRA in the context of fine tuning?

LoRA (Low Rank Adaption of LLMs) is a technique where you only need to add a small number of extra parameters (< 1%) and fine tune those. In the case of this Llama-2-7b model, fewer than 70m parameters would need to be trained. This makes the training process much faster and cheaper, and the model does surprisingly well on most tasks. The end result is we now have a small adapter that can be added to the base model to achieve high performance on the target task. Swapping only the LoRA weights instead of all parameters allows cheaper switching between tasks. Multiple customized models can be created on one GPU and swapped in and out easily.

How LoRA can help reduce costs if you have multiple tasks

Let’s use an analogy inspired by a garden hose. Do you recall seeing a garden hose which can take various adapters like a regular stream, a jet, a cone, a mist, a shower, etc.? The same garden hose can spray water in different ways depending on the adapter you choose. If the total demand for water can be fulfilled by the water going to one hose you can serve varying use cases with these adapters (there’s no need for 12 different hoses for 12 use cases).

Dynamically swapping LoRA weights for fine tuned tasks works in a similar way. The foundation model is served on one GPU which is always running and can swap between different LoRA weights as needed. As long as enough models are served that the GPU will always be warm and utilized, cost can be split across all the use cases.

However, for most companies, it’s difficult to fully occupy a GPU’s capacity. Costs add up if your GPU is sitting idle. If you only selectively use the GPU then you have to overcome cold start problem, adding latency of up to 5 minutes (‼️)

This is where an aggregate like Vellum comes in. We serve enough use cases across customers to always keep GPUs occupied. In the low usage limit (i.e., when an individual fine tuned model is not used too much), cost per request goes down by ~99% and additional latency is only 50ms.

Next steps

If you’re interested in exploring a lower cost alternative to your current fine tuned model or prompt based model please reach out to me at akash@vellum.ai. At Vellum we abstract away the complexities with training models and make them extremely easy to use.

ABOUT THE AUTHOR
Sidd Seethepalli
Co-founder and CTO

Sidd Seethepalli, CTO and co-founder at Vellum (YC W23) is very passionate about LLM product development, and is constantly pushing the boundaries of what’s possible with current models and techniques for more than 100 customers at Vellum who use LLMs in production. Before starting Vellum, Sidd completed his undergrad at the Massachusetts Institute of Technology, then spent 4 years working for well known tech companies like Quora and Dover.

ABOUT THE reviewer

No items found.
lAST UPDATED
Aug 3, 2023
share post
Expert verified
Related Posts
All
December 12, 2025
7 min
How we use coding agents to 2x engineering output
LLM basics
December 12, 2025
8 min
GPT-5.2 Benchmarks
LLM basics
December 4, 2025
8 min
Top 12 AI Workflow Platforms
Product Updates
December 3, 2025
12 min
Vellum Product Update | November
Model Comparisons
November 27, 2025
18 min
Flagship Model Report: Gpt-5.1 vs Gemini 3 Pro vs Claude Opus 4.5
LLM basics
November 27, 2025
14 min
Gumloop Alternatives (Reviewed & Explained)
The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska
Head of Engineering

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Jeremy Hicks
Solutions Architect

Experiment, Evaluate, Deploy, Repeat.

AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component, Use {{general-cta}}

Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component  [For enterprise], Use {{general-cta-enterprise}}

The best AI agent platform for enterprises
Production-grade rigor in one platform: prompt builder, agent sandbox, and built-in evals and monitoring so your whole org can go AI native.

[Dynamic] Ebook CTA component using the Ebook CMS filtered by name of ebook.
Use {{ebook-cta}} and add a Ebook reference in the article

Thank you!
Your submission has been received!
Oops! Something went wrong while submitting the form.
Button Text

LLM leaderboard CTA component. Use {{llm-cta}}

Check our LLM leaderboard
Compare all open-source and proprietary model across different tasks like coding, math, reasoning and others.

Case study CTA component (ROI) = {{roi-cta}}

40% cost reduction on AI investment
Learn how Drata’s team uses Vellum and moves fast with AI initiatives, without sacrificing accuracy and security.

Case study CTA component (cutting eng overhead) = {{coursemojo-cta}}

6+ months on engineering time saved
Learn how CourseMojo uses Vellum to enable their domain experts to collaborate on AI initiatives, reaching 10x of business growth without expanding the engineering team.

Case study CTA component (Time to value) = {{time-cta}}

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.

[Dynamic] Guide CTA component using Blog Post CMS, filtering on Guides’ names

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.
New CTA
Sorts the trigger and email categories

Dynamic template box for healthcare, Use {{healthcare}}

Start with some of these healthcare examples

Healthcare explanations of a patient-doctor match
Summarize why a patient was matched with a specific provider.
Population health insights reporter
Combine healthcare sources and structure data for population health management.

Dynamic template box for insurance, Use {{insurance}}

Start with some of these insurance examples

AI agent for claims review
Review healthcare claims, detect anomalies and benchmark pricing.
Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.
Agent that summarizes lengthy reports (PDF -> Summary)
Summarize all kinds of PDFs into easily digestible summaries.

Dynamic template box for eCommerce, Use {{ecommerce}}

Start with some of these eCommerce examples

E-commerce shopping agent
Check order status, manage shopping carts and process returns.

Dynamic template box for Marketing, Use {{marketing}}

Start with some of these marketing examples

Earnings call summarizer agent
Earnings call transcript into key takeaways and a 4 to 5 slide brief ready for Gamma.
LinkedIn Content Planning Agent
Create a 30-day Linkedin content plan based on your goals and target audience.

Dynamic template box for Sales, Use {{sales}}

Start with some of these sales examples

Objection capture agent for sales calls
Take call transcripts, extract objections, and update the associated Hubspot contact record.
Active deals health check agent
Sends a weekly HubSpot deal health update, ranks deals and enables the sales team.

Dynamic template box for Legal, Use {{legal}}

Start with some of these legal examples

Contract review agent
Reviews contract text against a checklist, flags deviations, scores risk, and produces a lawyer friendly summary.
Legal RAG chatbot
Chatbot that provides answers based on user queries and legal documents.

Dynamic template box for Supply Chain/Logistics, Use {{supply}}

Start with some of these supply chain examples

Risk assessment agent for supply chain operations
Comprehensive risk assessment for suppliers based on various data inputs.

Dynamic template box for Edtech, Use {{edtech}}

Start with some of these edtech examples

No items found.

Dynamic template box for Compliance, Use {{compliance}}

Start with some of these compliance examples

No items found.

Dynamic template box for Customer Support, Use {{customer}}

Start with some of these customer support examples

Renewal tracker agent
Create an agent that scans HubSpot for deals with upcoming renewal dates in the next 60 days.
Ticket Escalation Bot

Template box, 2 random templates, Use {{templates}}

Start with some of these agents

Content Repurposing Agent
This agent transforms a webinar transcript into publish-ready content.
AI legal research agent
Comprehensive legal research memo based on research question, jurisdiction and date range.

Template box, 6 random templates, Use {{templates-plus}}

Build AI agents in minutes

Healthcare explanations of a patient-doctor match
Summarize why a patient was matched with a specific provider.
Claims compliance review agent
Examines claim submissions for compliance and recommends corrections
Research agent for sales demos
Company research based on Linkedin and public data as a prep for sales demo.
KYC compliance agent
Automates KYC checks by reviewing customer documents stored in HubSpot
Client portfolio review agent
Compiles weekly portfolio summaries from PDFs, highlights performance and risk, builds a Gamma presentation deck.
Retail pricing optimizer agent
Analyze product data and market conditions and recommend pricing strategies.

Build AI agents in minutes for

{{industry_name}}

Stripe transaction review agent
Analyzes recent Stripe transactions for suspicious patterns, flags potential fraud, posts a summary in Slack.
KYC compliance agent
Automates KYC checks by reviewing customer documents stored in HubSpot
Client portfolio review agent
Compiles weekly portfolio summaries from PDFs, highlights performance and risk, builds a Gamma presentation deck.
Contract review agent
Reviews contract text against a checklist, flags deviations, scores risk, and produces a lawyer friendly summary.
NDA deviation review agent
Reviews NDAs against your standard template, highlights differences, and sends a risk rated summary to Slack.
Compliance review agent
Checks DPAs and privacy policies against your compliance checklist then scores coverage and make a plan.

Case study results overview (usually added at top of case study)

What we did:

1-click

This is some text inside of a div block.

28,000+

Separate vector databases managed per tenant.

100+

Real-world eval tests run before every release.