How to Set Spending Caps on LLM API Calls Per User

Why spending caps matter

Traditional SaaS pricing assumes your cost per customer is roughly fixed. A seat is a seat. AI broke that assumption. You collect a flat subscription price, but the cost side is variable, driven entirely by how many tokens each user pushes through the model.

The gap between those two lines is where your margin lives, and it is not evenly distributed. Most users are cheap. A handful are not. On gpt-4o at $2.50 per million input tokens and $10 per million output tokens, a user running a chat-heavy workflow all day can rack up tens of dollars a month against a $49 plan. Ten of them and you are underwater on a plan that looked healthy in the spreadsheet.

You can raise prices, but that punishes the 90 percent of users who cost you almost nothing. A per-user spending cap is the surgical option: leave the price alone, and put a limit on the specific users who would otherwise run away with your margin.

How to implement a manual spending cap in code

The naive version is straightforward. Track how much each user has spent this month, check that number before every call, and refuse the call if they are over. Here is the shape of it in TypeScript:

const CAPS = { free: 0, pro: 49 } // monthly USD ceiling per plan
const spendThisMonth = new Map<string, number>() // userId -> USD

async function chatWithCap(userId: string, plan: string, params: ChatParams) {
  const spent = spendThisMonth.get(userId) ?? 0
  const cap = CAPS[plan] ?? 0

  if (spent >= cap) {
    throw new Error('AI spending cap reached')
  }

  const response = await openai.chat.completions.create(params)

  const cost =
    (response.usage.prompt_tokens * 2.5 +
      response.usage.completion_tokens * 10) /
    1_000_000

  spendThisMonth.set(userId, spent + cost)
  return response
}

This works on your laptop. You call it, the counter climbs, and once a user crosses their cap the throw fires and the request is blocked. For a demo or a single-box side project, it is genuinely fine.

The problem with manual caps

The trouble starts the moment you run more than one copy of your server, which is to say the moment you deploy to anything real.

That spendThisMonthmap lives in one process’s memory. Run three instances behind a load balancer and you have three separate counters that never talk to each other. A user whose requests get spread across all three sees their effective cap roughly triple. Add a second region and it gets worse. Deploy a new version and every counter resets to zero, so everyone’s spend silently starts over mid-month.

There is a concurrency problem hiding in there too. Two requests from the same user can both read the old spend, both pass the check, and both fire before either writes the new total back. Under load the cap becomes a suggestion.

And notice what this design can’t do: there is no graceful path. It is a hard wall. When the user hits the cap, the feature just breaks. You have no built-in way to say “keep working, but on the cheaper model,” which is usually what you actually want, because a downgraded answer beats an error message for a paying customer.

You can fix all of this. Move the counter into Redis, add atomic increments, build a per-plan config, wire up a model fallback table, handle the month boundary. It is a real project, and it is a project you will be maintaining instead of building your product.

How Weckr spending caps work

Weckr moves the whole thing behind the same wrapper you use for cost tracking. You declare your plans once, then pass a userId and plan on each call, and the SDK enforces the cap for you:

import { Weckr } from '@weckr/sdk'

const wk = new Weckr({
  apiKey: process.env.WECKR_API_KEY,
  plans: { free: 0, pro: 49 },
})

const response = await wk.chat(openai, {
  model: 'gpt-4o',
  messages: [{ role: 'user', content: prompt }],
  userId: user.id,
  feature: 'chat',
  plan: user.plan,
})

Every cap has one of two actions, and you pick per plan in the dashboard.

Action: block

When a user on a block cap goes over their monthly limit, the SDK throws a WeckrCapError. You catch it and show whatever makes sense for your product: an upgrade prompt, a “you’ve hit this month’s limit” notice, a link to the pricing page.

import { Weckr, WeckrCapError } from '@weckr/sdk'

try {
  const response = await wk.chat(openai, {
    model: 'gpt-4o',
    messages,
    userId: user.id,
    feature: 'chat',
    plan: user.plan,
  })
  return response
} catch (err) {
  if (err instanceof WeckrCapError) {
    return { upgrade: true, message: 'You have reached this month\'s AI limit.' }
  }
  throw err
}

Action: downgrade

A downgrade cap never throws. When the user is over their limit, Weckr silently swaps the requested model for a cheaper sibling in the same provider and lets the call go through. You asked for gpt-4o, they get gpt-4o-mini. The response comes back in exactly the same shape, so nothing downstream needs to change. Some of the built-in mappings:

// requested model  ->  downgraded model (same provider)
gpt-4o           ->  gpt-4o-mini
claude-opus-4    ->  claude-sonnet-4
gemini-2.5-pro   ->  gemini-2.5-flash

The cost difference is real. gpt-4o runs $2.50 and $10 per million tokens; gpt-4o-mini is $0.15 and $0.60. For a user who has already blown through their cap, that is the difference between bleeding money on every call and staying roughly break-even while they keep using the feature.

The check is fast, and it fails open

The obvious worry with any cap is latency. If Weckr had to phone home before every single call, that would be a round trip on your hot path. It doesn’t. The SDK checks a user’s cap status and then caches that status for 60 seconds, so in the worst case a given user triggers at most one extra check per minute. Everything in that window reuses the cached result.

Just as important, the cap fails open. If Weckr is unreachable, the SDK does not block your users; the LLM call goes through as normal. A cost-control feature should never be able to take down the product it is protecting, so an outage on our side degrades to “no cap enforced right now” rather than “nobody can use your app.”

Setting caps in the dashboard without code

The caps themselves are not baked into your codebase. Each project has per-plan monthly USD caps in the dashboard settings, so the person tuning them does not have to be the person who ships deploys.

That matters because the people who care most about margin, the founder and the PM, are often not the ones touching the LLM code. In the dashboard they set the free tier to a low ceiling and pro to a higher one, choose block or downgrade for each, and save. The change takes effect on the next cap check, which is at most a minute out because of the 60 second cache. No pull request, no deploy, no waiting on an engineer.

So the split is clean. Your code passes a userId and a plan on every call and stays the same forever. The actual numbers, the dollar ceilings and the block-versus-downgrade choice, live in settings where whoever owns pricing can adjust them the moment the data says a plan is underwater.

FAQ

How do I limit how much each user can spend on AI calls?

Set a monthly USD cap per user and check the running spend before every LLM call. If the user is over the cap, you either block the call or route it to a cheaper model. The Weckr SDK does this for you: you pass a userId and a plan on each call and it enforces the cap you configured for that plan.

What is the best way to enforce LLM budget limits per customer?

Enforce the cap server-side, right before the provider call, using spend that is tracked in a shared store rather than in each server process. In-memory counters leak across regions and instances, so a user can blow past the limit. Weckr keeps the per-user spend centrally and checks it before each call, with a 60 second cache so it stays fast.

What happens when a user hits their AI spending cap?

It depends on the action you choose. With the block action the SDK throws a WeckrCapError that you catch and turn into an upgrade prompt or a friendly limit message. With the downgrade action the call still goes through, but on a cheaper model in the same provider, so the user keeps working and your cost drops.

Can I automatically downgrade to a cheaper model when a user hits their cap?

Yes. Set the cap action to downgrade and Weckr silently swaps the requested model for a cheaper sibling in the same provider, for example gpt-4o to gpt-4o-mini or claude-opus-4 to claude-sonnet-4. Your code does not change and the response comes back in the same shape. The user keeps the feature, just at a lower cost per token.

How do I set different spending limits for different subscription plans?

Give each plan its own monthly USD cap. In Weckr you configure per-plan caps in the project dashboard settings, and pass the user plan on each call so the SDK knows which limit applies. You can set the free tier to a low cap and pro to a higher one, and change either without shipping code.

Cap the runaway users, keep the profitable ones

A flat price against a variable cost only works if you put a floor under the worst case. A per-user spending cap is that floor: it lets the 90 percent of customers who cost you nothing keep their full experience, while the one user running 400 calls a day either gets an upgrade prompt or quietly moves to the cheaper model. You keep the good margin and stop subsidizing the bad. Caps are one tactic among several ways to reduce your OpenAI costs. See caps, blocks, and downgrades running on seeded data, no signup needed, at useweckr.com/demo.