How-to · Cost reduction

How to Reduce OpenAI Costs Without Breaking Your Product

Microsoft Engineering just cancelled Claude Code access for 100,000 engineers. Uber burned through their entire 2026 AI budget in four months. These aren’t reckless companies, they’re you, scaled up.

If you want to reduce OpenAI costs without breaking what your product does, you need to understand where the spending actually goes. Most LLM bills break down into three categories of waste, and each has a fix that doesn’t degrade output quality.

The OpenAI bill that won’t stop growing

Every founder I have talked to who runs an AI SaaS tells the same story. Month one was $400. Month three was $1,800. Month six was $11,000.

The product didn’t grow 30x. The bill did.

The team adds a “make it smarter” feature, the default model is gpt-4o, and some user finds a way to call it 200 times a day. Nobody notices until the invoice arrives. Then panic mode kicks in: rip out features, throttle users, push down the model temperature, all of which damage the product when they don’t need to.

The three real causes of a high OpenAI bill

Almost every bloated bill comes from one or more of these three causes:

  1. You are using gpt-4o for tasks gpt-4o-mini does just as well. The premium model is 16x more expensive on output tokens.
  2. You have no per-user spending limits, so one power user can burn what 500 normal users do in a month.
  3. You have no visibility into which feature is the cost driver. You assume it’s the chat, when it is actually the silent summarization that runs in the background on every doc upload.

Fixing any one of these usually drops the bill by 30 to 70 percent. Fixing all three is the difference between a viable AI product and one that bleeds out.

How to pick the right model for each task

Use gpt-4o-mini by default. Move to gpt-4o only when you can name the specific reasoning capability that requires it.

For these tasks, gpt-4o-mini is fine:

  • Summarizing short documents
  • Classifying intent (“is this a support request or a sales lead?”)
  • Extracting structured fields from text
  • Rewriting tone or style
  • Generating short text snippets

For these, you probably need gpt-4o or claude-sonnet-4:

  • Multi-step reasoning (“plan a 7-day trip to Tokyo with these constraints”)
  • Code generation that has to actually compile
  • Long-context document analysis (over 10k tokens)
  • Anything where a wrong answer materially hurts the user

The honest test: build the cheap version first, ship it to 10 users, and only upgrade if quality is unacceptable. Almost always, the cheap version works fine.

Daily LLM spend (gpt-4o → gpt-4o-mini on day 14)$0$10$20Downgradesaves $140/mo

The chart above shows what happens when one feature flips from gpt-4o to gpt-4o-mini on day 14. The monthly cost trajectory drops from roughly $480 to $34. Same feature, same output quality for the task it does.

Set per-user spending caps before someone burns your budget

The Uber lesson: without per-user caps, one bad actor (or one well-meaning power user) can spike your bill 10x in a week. You need a ceiling per user, per plan.

A cap has two possible actions when a user crosses it:

  • Block: the SDK throws an error before the call, and you catch it to show an upgrade prompt.
  • Downgrade: the SDK silently swaps to a cheaper model in the same provider, and the user notices nothing.

The downgrade path is what kept Uber’s situation from being an outage. They could have kept serving users at a lower model tier instead of cutting access entirely.

Automatic downgrade in two lines

Weckr handles this with a wrapped client:

import { Weckr, WeckrCapError } from '@weckr/sdk'

const wk = new Weckr({
  apiKey: process.env.WECKR_API_KEY,
  plans: { free: 0, starter: 9, pro: 29 },
})

try {
  const result = await wk.chat(openai, {
    model: 'gpt-4o',
    messages: [{ role: 'user', content: prompt }],
    userId: user.id,
    plan: user.plan,
  })
} catch (err) {
  if (err instanceof WeckrCapError) {
    return { error: 'Usage limit reached. Upgrade to keep going.' }
  }
  throw err
}

Configure the cap in the dashboard (for example, free plan: $6/mo cap, action downgrade). When a free user hits $6 in API spend, the SDK automatically switches their next call to gpt-4o-mini.

They keep using the product, you stop bleeding money. Full setup at useweckr.com/docs.

A real example: cutting $140 a month on one feature

Pretend you run an AI writing tool. The main “improve my paragraph” feature uses gpt-4o, average call is 350 input tokens and 200 output tokens, and you handle about 40,000 of these per month.

Monthly gpt-4o cost: (350 × $2.50 + 200 × $10) / 1,000,000 × 40,000 = $115. Add three sibling features and you are at $480/month for one product surface.

You measure the output of “improve my paragraph” on both models. You can’t tell them apart in blind A/B tests, so you flip the default to gpt-4o-mini.

New cost: (350 × $0.15 + 200 × $0.60) / 1,000,000 × 40,000 = $7. You saved $108 on one feature, and roughly $140 to $300/month across the surface. Compounded over a year, that pays for a junior contractor.

FAQ

How do I reduce my OpenAI API bill?

Three things work: use cheaper models for simple tasks, set spending caps per user so heavy users do not blow your budget, and detect agent loops before they run thousands of unnecessary calls. Weckr automates all three.

What is the cheapest OpenAI model for a SaaS app?

GPT-4o-mini costs $0.15 per million input tokens and $0.60 per million output tokens, making it the best default for most SaaS use cases. Reserve GPT-4o for complex tasks only.

How do I automatically switch to a cheaper model when costs get too high?

Set a spending cap in Weckr with action set to downgrade. When a user hits their monthly cap, Weckr automatically routes their calls to gpt-4o-mini instead of gpt-4o. No code change needed.

What is an AI agent reasoning loop and how do I detect it?

A reasoning loop is when an AI agent repeatedly calls the LLM with similar prompts without making progress, burning tokens and money. Weckr detects this by monitoring token velocity. If a single user session exceeds 50,000 tokens in 5 minutes, it fires a Slack alert immediately.

How much can I save by switching from GPT-4o to GPT-4o-mini?

GPT-4o costs $2.50 per million input tokens versus $0.15 for GPT-4o-mini, roughly 16x cheaper. For a feature making 5,000 calls per month with average 200 token outputs, the saving is around $140 per month.

Stop letting the OpenAI bill set itself

The companies losing this game all have one thing in common. They treat the OpenAI invoice as an external force. It isn’t.

You decide which model gets called for what, at what limit. Pick the right model per task, set per-user caps, get visibility per feature. See it working with real data at useweckr.com/demo.

See the dashboard with real data, no signup needed.

Try the demo →