What is the retry-after header and how should I use it?

OpenAI's 429 responses include a `retry-after-ms` header (or `retry-after` in seconds) telling you the minimum wait before retrying. Always respect it. If you retry sooner, you stay throttled longer and the limit takes more wall-clock time to recover.

How do I prevent retry storms when I scale workers?

Use a shared rate-limit bucket across all your workers. Per-worker rate limits multiply by worker count and create the storms you're trying to avoid. SimpleQ implements this with a `rateLimitKey` that names a shared bucket; you can do the same with Redis + a sliding window counter.

Should I retry on every 429?

Yes, but with exponential backoff and a max-attempts cap. Retries on 429s and transient 5xxs are safe (and expected). Retries on 400s, 401s, and content-policy errors are not — those are permanent failures that need different handling.

When should I fall back to a different model?

Model fallback is a last resort, not a first response. Use it for sustained capacity issues — e.g., the rate-limit bucket has been empty for over a minute — not for individual 429s. The cost is quality variance; the gain is availability.

How to handle OpenAI rate limits in production

TL;DR

OpenAI's rate limits are per-organization, per-model, enforced on requests-per-minute (RPM) and tokens-per-minute (TPM). The patterns that work in production: a single shared token bucket per key, exponential backoff with jitter that respects retry-after, careful retry classification, and capacity planning against your tier. Model fallback exists, but use it as a last resort.

Every AI app eventually meets the same wall: 429s in production. They show up gradually — one a day, then ten, then a thousand — and the fix is never just 'add a retry.' This post walks through what OpenAI's rate limits actually do, why naive retries make things worse, and the patterns that work at scale.

The OpenAI rate-limit model

OpenAI's rate limits operate at the organization level, separately for each model, with two simultaneous dimensions:

RPM (requests per minute). A simple count of API calls.
TPM (tokens per minute). The total prompt + completion tokens across all calls.

You hit a 429 the moment either limit is exceeded. Your tier determines the ceilings: as your usage and payment history grow, OpenAI moves your org up the tier ladder and raises both. You can check your current limits in the OpenAI dashboard.

Estimation is hard

TPM is based on input tokens you send and output tokens the model returns. The output is unknown at send time, but OpenAI reserves capacity based on your max_tokens parameter. If you set max_tokens: 4096 on every call, you'll hit TPM far sooner than if you cap it at what you actually need. Tighten this aggressively — it's the cheapest improvement most teams can make.

What to read from a 429 response

When OpenAI throttles you, the response includes useful metadata. The most important headers:

Header	What it tells you
x-ratelimit-limit-requests	Your RPM ceiling for this model
x-ratelimit-remaining-requests	Requests remaining in current window
x-ratelimit-reset-requests	Duration until RPM window resets
x-ratelimit-limit-tokens	Your TPM ceiling for this model
x-ratelimit-remaining-tokens	Tokens remaining in current window
x-ratelimit-reset-tokens	Duration until TPM window resets
retry-after-ms	On 429s: minimum wait before retrying (ms)

In a healthy app, you read these on every successful response, not just on 429s. They let you predict throttling before it happens — and start slowing down preemptively, instead of crashing into a wall.

Why naive retries make things worse

The most common mistake is per-worker retry loops. Imagine four workers, each retrying their own 429s with exponential backoff. When the rate limit refreshes, all four resume at once, immediately re-saturate the limit, and trigger another wave of 429s. You haven't fixed throttling — you've turned it into a sawtooth.

The fix is to coordinate. There are two viable patterns:

Shared token bucket. All workers consume from one Redis-backed bucket keyed by upstream name (e.g. openai-gpt-4o-mini-prod). Bucket refills at the rate OpenAI allows. Workers wait for tokens before sending requests.
Single dispatcher. One process owns the rate budget and dispatches work to a worker pool. Workers don't make API calls directly.

Both work; shared token buckets scale better because there's no single dispatcher to bottleneck on. This is what SimpleQ does internally — every job with a given rateLimitKey consumes from the same bucket, regardless of how many workers you're running.

Exponential backoff with jitter

When you do need to retry, the standard pattern is exponential backoff with full jitter:

lib/retry.ts

1function nextDelayMs(attempt: number, opts: {
2  initialMs: number;
3  maxMs: number;
4}): number {
5  const exp = Math.min(opts.maxMs, opts.initialMs * 2 ** attempt);
6  // Full jitter: uniformly random in [0, exp]
7  return Math.floor(Math.random() * exp);
8}
9 
10// Example: attempt 0 → 0-1000ms, attempt 1 → 0-2000ms, attempt 5 → 0-32000ms (capped)

Jitter is critical. Without it, every retrying client wakes up at the same time and reproduces the storm. Full jitter (uniform in [0, exp]) is provably as good or better than partial jitter for this case — see the AWS Architecture Blog's seminal post on the subject.

When OpenAI sends a retry-after-ms, use it as a floor — don't retry sooner. But still add some jitter on top to avoid thundering-herd resumes when many requests have been throttled at the same time.

Retry classification

Not every error should retry. A short classifier:

Status / Error	Action
429 (rate limit)	Retry with backoff, respect retry-after
500, 502, 503, 504	Retry with backoff (capped)
408, ETIMEDOUT, ECONNRESET	Retry with backoff
400 (bad request)	Do not retry — fix the input
401 (auth)	Do not retry — fix the key
403 (content policy)	Do not retry — review the prompt
404 (not found)	Do not retry — fix the model name

Retrying on a 400 is a bug, not a fix. Retrying on a 403 (content policy violation) can get your account flagged. The classifier is one of the most valuable ~20 lines you can write.

Model fallback — a last resort

When sustained capacity is unavailable on your primary model (e.g., the bucket has been empty for over a minute), some teams fall back to a smaller or different model. This is a trade-off:

Pro: the user gets some answer instead of a timeout.
Con: quality varies, which can be worse for some use cases than failing loudly.
Con: fallback paths are usually less-tested and accumulate their own bugs.

Use fallback for non-critical features (recommendations, summaries) where degraded quality beats no response. Skip it for evaluation-sensitive flows where consistent output matters.

Capacity planning

Reactive retries are a tactic; capacity planning is the strategy. Three things to track weekly:

1Your peak RPM and peak TPM on each model, with a 95th-percentile margin.
2Your tier ceilings for those models. Set alerts when you're consistently within 70% of either.
3Your `max_tokens` settings. Find the prompts that reserve significantly more output than they use, and tighten them.

Once you're approaching a ceiling, request a tier bump (OpenAI is generally responsive if you can show usage). And consider splitting workloads across multiple OpenAI organizations to multiply your effective ceiling.

Putting it all together

A production-grade OpenAI client looks roughly like this — shared bucket, classifier, backoff, capped attempts:

server/openai-client.ts

1import { SimpleQ } from "@simpleq/sdk";
2 
3const simpleq = new SimpleQ({ apiKey: process.env.SIMPLEQ_API_KEY });
4 
5// Configure once
6await simpleq.rateLimits.set("openai-gpt-4o-mini-prod", {
7  requestsPerSecond: 50,
8  burst: 100
9});
10 
11// Every call goes through the queue
12export async function chat(input: string, userId: string) {
13  return simpleq.enqueue({
14    queue: "ai-jobs",
15    type: "openai.chat",
16    idempotencyKey: `chat_${userId}_${hash(input)}`,
17    payload: {
18      model: "gpt-4o-mini",
19      messages: [{ role: "user", content: input }],
20      max_tokens: 512
21    },
22    retry: {
23      maxAttempts: 6,
24      backoff: "exponential",
25      jitter: "full"
26    },
27    rateLimitKey: "openai-gpt-4o-mini-prod"
28  });
29}

That's the whole pattern. The queue handles retry/backoff/classifier; the rate-limit key handles capacity coordination across your fleet; the idempotency key prevents duplicate work on retries.

If you want this without building the bucket yourself, SimpleQ implements the full pattern — rate-limit keys, retry classifiers, dead-letter queues, and per-job observability. The free tier covers 10,000 executions a month. See the AI use case for an end-to-end example.

Frequently asked questions

OpenAI enforces two main limits per model, per organization: requests per minute (RPM) and tokens per minute (TPM). Limits increase as you move up usage tiers. When you exceed either, OpenAI returns a 429 with a `retry-after` header indicating when to retry.

Try SimpleQ