Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

Paper page – DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster

3D Printing Auxetic Materials | Two Minute Papers #96

Theano Tutorial (Pascal Lamblin, MILA)

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • Amazon (Titan)
    • Anthropic (Claude 3)
    • Cohere (Command R)
    • Google DeepMind (Gemini)
    • IBM (Watsonx)
    • Inflection AI (Pi)
    • Meta (LLaMA)
    • OpenAI (GPT-4 / GPT-4o)
    • Reka AI
    • xAI (Grok)
    • Adobe Sensi
    • Aleph Alpha
    • Alibaba Cloud (Qwen)
    • Apple Core ML
    • Baidu (ERNIE)
    • ByteDance Doubao
    • C3 AI
    • DataRobot
    • DeepSeek
  • AI Research & Breakthroughs
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Education AI
    • Energy AI
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Media & Entertainment
    • Transportation AI
    • Manufacturing AI
    • Retail AI
    • Agriculture AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
Facebook X (Twitter) Instagram
Advanced AI News
VentureBeat AI

The inference trap: How cloud providers are eating your AI margins

Advanced AI EditorBy Advanced AI EditorJune 28, 2025No Comments8 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


This article is part of VentureBeat’s special issue, “The Real Cost of AI: Performance, Efficiency and ROI at Scale.” Read more from this special issue.

AI has become the holy grail of modern companies. Whether it’s customer service or something as niche as pipeline maintenance, organizations in every domain are now implementing AI technologies — from foundation models to VLAs — to make things more efficient. The goal is straightforward: automate tasks to deliver outcomes more efficiently and save money and resources simultaneously.

However, as these projects transition from the pilot to the production stage, teams encounter a hurdle they hadn’t planned for: cloud costs eroding their margins. The sticker shock is so bad that what once felt like the fastest path to innovation and competitive edge becomes an unsustainable budgetary blackhole – in no time. 

This prompts CIOs to rethink everything—from model architecture to deployment models—to regain control over financial and operational aspects. Sometimes, they even shutter the projects entirely, starting over from scratch.

But here’s the fact: while cloud can take costs to unbearable levels, it is not the villain. You just have to understand what type of vehicle (AI infrastructure) to choose to go down which road (the workload).

The cloud story — and where it works 

The cloud is very much like public transport (your subways and buses). You get on board with a simple rental model, and it instantly gives you all the resources—right from GPU instances to fast scaling across various geographies—to take you to your destination, all with minimal work and setup. 

The fast and easy access via a service model ensures a seamless start, paving the way to get the project off the ground and do rapid experimentation without the huge up-front capital expenditure of acquiring specialized GPUs. 

Most early-stage startups find this model lucrative as they need fast turnaround more than anything else, especially when they are still validating the model and determining product-market fit.

“You make an account, click a few buttons, and get access to servers. If you need a different GPU size, you shut down and restart the instance with the new specs, which takes minutes. If you want to run two experiments at once, you initialise two separate instances. In the early stages, the focus is on validating ideas quickly. Using the built-in scaling and experimentation frameworks provided by most cloud platforms helps reduce the time between milestones,” Rohan Sarin, who leads voice AI product at Speechmatics, told VentureBeat.

The cost of “ease”

While cloud makes perfect sense for early-stage usage, the infrastructure math becomes grim as the project transitions from testing and validation to real-world volumes. The scale of workloads makes the bills brutal — so much so that the costs can surge over 1000% overnight. 

This is particularly true in the case of inference, which not only has to run 24/7 to ensure service uptime but also scale with customer demand. 

On most occasions, Sarin explains, the inference demand spikes when other customers are also requesting GPU access, increasing the competition for resources. In such cases, teams either keep a reserved capacity to make sure they get what they need — leading to idle GPU time during non-peak hours — or suffer from latencies, impacting downstream experience.

Christian Khoury, the CEO of AI compliance platform EasyAudit AI, described inference as the new “cloud tax,” telling VentureBeat that he has seen companies go from $5K to $50K/month overnight, just from inference traffic.

It’s also worth noting that inference workloads involving LLMs, with token-based pricing, can trigger the steepest cost increases. This is because these models are non-deterministic and can generate different outputs when handling long-running tasks (involving large context windows). With continuous updates, it gets really difficult to forecast or control LLM inference costs.

Training these models, on its part, happens to be “bursty” (occurring in clusters), which does leave some room for capacity planning. However, even in these cases, especially as growing competition forces frequent retraining, enterprises can have massive bills from idle GPU time, stemming from overprovisioning.

“Training credits on cloud platforms are expensive, and frequent retraining during fast iteration cycles can escalate costs quickly. Long training runs require access to large machines, and most cloud providers only guarantee that access if you reserve capacity for a year or more. If your training run only lasts a few weeks, you still pay for the rest of the year,” Sarin explained.

And, it’s not just this. Cloud lock-in is very real. Suppose you have made a long-term reservation and bought credits from a provider. In that case, you’re locked in their ecosystem and have to use whatever they have on offer, even when other providers have moved to newer, better infrastructure. And, finally, when you get the ability to move, you may have to bear massive egress fees.

“It’s not just compute cost. You get…unpredictable autoscaling, and insane egress fees if you’re moving data between regions or vendors. One team was paying more to move data than to train their models,” Sarin emphasized.

So, what’s the workaround?

Given the constant infrastructure demand of scaling AI inference and the bursty nature of training, enterprises are moving to splitting the workloads — taking inference to colocation or on-prem stacks, while leaving training to the cloud with spot instances.

This isn’t just theory — it’s a growing movement among engineering leaders trying to put AI into production without burning through runway.

“We’ve helped teams shift to colocation for inference using dedicated GPU servers that they control. It’s not sexy, but it cuts monthly infra spend by 60–80%,” Khoury added. “Hybrid’s not just cheaper—it’s smarter.”

In one case, he said, a SaaS company reduced its monthly AI infrastructure bill from approximately $42,000 to just $9,000 by moving inference workloads off the cloud. The switch paid for itself in under two weeks.

Another team requiring consistent sub-50ms responses for an AI customer support tool discovered that cloud-based inference latency was insufficient. Shifting inference closer to users via colocation not only solved the performance bottleneck — but it halved the cost.

The setup typically works like this: inference, which is always-on and latency-sensitive, runs on dedicated GPUs either on-prem or in a nearby data center (colocation facility). Meanwhile, training, which is compute-intensive but sporadic, stays in the cloud, where you can spin up powerful clusters on demand, run for a few hours or days, and shut down. 

Broadly, it is estimated that renting from hyperscale cloud providers can cost three to four times more per GPU hour than working with smaller providers, with the difference being even more significant compared to on-prem infrastructure.

The other big bonus? Predictability. 

With on-prem or colocation stacks, teams also have full control over the number of resources they want to provision or add for the expected baseline of inference workloads. This brings predictability to infrastructure costs — and eliminates surprise bills. It also brings down the aggressive engineering effort to tune scaling and keep cloud infrastructure costs within reason. 

Hybrid setups also help reduce latency for time-sensitive AI applications and enable better compliance, particularly for teams operating in highly regulated industries like finance, healthcare, and education — where data residency and governance are non-negotiable.

Hybrid complexity is real—but rarely a dealbreaker

As it has always been the case, the shift to a hybrid setup comes with its own ops tax. Setting up your own hardware or renting a colocation facility takes time, and managing GPUs outside the cloud requires a different kind of engineering muscle. 

However, leaders argue that the complexity is often overstated and is usually manageable in-house or through external support, unless one is operating at an extreme scale.

“Our calculations show that an on-prem GPU server costs about the same as six to nine months of renting the equivalent instance from AWS, Azure, or Google Cloud, even with a one-year reserved rate. Since the hardware typically lasts at least three years, and often more than five, this becomes cost-positive within the first nine months. Some hardware vendors also offer operational pricing models for capital infrastructure, so you can avoid upfront payment if cash flow is a concern,” Sarin explained.

Prioritize by need

For any company, whether a startup or an enterprise, the key to success when architecting – or re-architecting – AI infrastructure lies in working according to the specific workloads at hand. 

If you’re unsure about the load of different AI workloads, start with the cloud and keep a close eye on the associated costs by tagging every resource with the responsible team. You can share these cost reports with all managers and do a deep dive into what they are using and its impact on the resources. This data will then give clarity and help pave the way for driving efficiencies.

That said, remember that it’s not about ditching the cloud entirely; it’s about optimizing its use to maximize efficiencies. 

“Cloud is still great for experimentation and bursty training. But if inference is your core workload, get off the rent treadmill. Hybrid isn’t just cheaper… It’s smarter,” Khoury added. “Treat cloud like a prototype, not the permanent home. Run the math. Talk to your engineers. The cloud will never tell you when it’s the wrong tool. But your AWS bill will.”



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleSound Synthesis for Fluids With Bubbles | Two Minute Papers #97
Next Article Runway AI Levels Up: From Hollywood to Gaming!
Advanced AI Editor
  • Website

Related Posts

The rise of prompt ops: Tackling hidden AI costs from bad inputs and context bloat

June 28, 2025

Scaling smarter: How enterprise IT teams can right-size their compute for AI

June 28, 2025

Retail Resurrection: David’s Bridal bets its future on AI after double bankruptcy

June 28, 2025
Leave A Reply Cancel Reply

Latest Posts

From Royal Drawings To Rare Meteorites

How Labubu Dolls Became 2025’s Viral Fashion Trend

Why Is That Revealing Photograph of Lorde Going Viral?

Vancouver Art Gallery Lays Off 30 Unionized Employees

Latest Posts

Paper page – DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster

June 28, 2025

3D Printing Auxetic Materials | Two Minute Papers #96

June 28, 2025

Theano Tutorial (Pascal Lamblin, MILA)

June 28, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • Paper page – DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster
  • 3D Printing Auxetic Materials | Two Minute Papers #96
  • Theano Tutorial (Pascal Lamblin, MILA)
  • From Royal Drawings To Rare Meteorites
  • Paper page – FairyGen: Storied Cartoon Video from a Single Child-Drawn Character

Recent Comments

No comments to show.

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

YouTube LinkedIn
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.