OpenAI's rate limits are per-organization, per-model, enforced on requests-per-minute (RPM) and tokens-per-minute (TPM). The patterns that work in production: a single shared token bucket per key, exponential backoff with jitter that respects retry-after, careful retry classification, and capacity planning against your tier. Model fallback exists, but use it as a last resort.
Every AI app eventually meets the same wall: 429s in production. They show up gradually — one a day, then ten, then a thousand — and the fix is never just 'add a retry.' This post walks through what OpenAI's rate limits actually do, why naive retries make things worse, and the patterns that work at scale.
The OpenAI rate-limit model
OpenAI's rate limits operate at the organization level, separately for each model, with two simultaneous dimensions:
- RPM (requests per minute). A simple count of API calls.
- TPM (tokens per minute). The total prompt + completion tokens across all calls.
You hit a 429 the moment either limit is exceeded. Your tier determines the ceilings: as your usage and payment history grow, OpenAI moves your org up the tier ladder and raises both. You can check your current limits in the OpenAI dashboard.
TPM is based on input tokens you send and output tokens the model returns. The output is unknown at send time, but OpenAI reserves capacity based on your max_tokens parameter. If you set max_tokens: 4096 on every call, you'll hit TPM far sooner than if you cap it at what you actually need. Tighten this aggressively — it's the cheapest improvement most teams can make.
What to read from a 429 response
When OpenAI throttles you, the response includes useful metadata. The most important headers:
| Header | What it tells you |
|---|---|
| x-ratelimit-limit-requests | Your RPM ceiling for this model |
| x-ratelimit-remaining-requests | Requests remaining in current window |
| x-ratelimit-reset-requests | Duration until RPM window resets |
| x-ratelimit-limit-tokens | Your TPM ceiling for this model |
| x-ratelimit-remaining-tokens | Tokens remaining in current window |
| x-ratelimit-reset-tokens | Duration until TPM window resets |
| retry-after-ms | On 429s: minimum wait before retrying (ms) |
In a healthy app, you read these on every successful response, not just on 429s. They let you predict throttling before it happens — and start slowing down preemptively, instead of crashing into a wall.
Why naive retries make things worse
The most common mistake is per-worker retry loops. Imagine four workers, each retrying their own 429s with exponential backoff. When the rate limit refreshes, all four resume at once, immediately re-saturate the limit, and trigger another wave of 429s. You haven't fixed throttling — you've turned it into a sawtooth.
The fix is to coordinate. There are two viable patterns:
- Shared token bucket. All workers consume from one Redis-backed bucket keyed by upstream name (e.g.
openai-gpt-4o-mini-prod). Bucket refills at the rate OpenAI allows. Workers wait for tokens before sending requests. - Single dispatcher. One process owns the rate budget and dispatches work to a worker pool. Workers don't make API calls directly.
Both work; shared token buckets scale better because there's no single dispatcher to bottleneck on. This is what SimpleQ does internally — every job with a given rateLimitKey consumes from the same bucket, regardless of how many workers you're running.
Exponential backoff with jitter
When you do need to retry, the standard pattern is exponential backoff with full jitter:
1function nextDelayMs(attempt: number, opts: {2 initialMs: number;3 maxMs: number;4}): number {5 const exp = Math.min(opts.maxMs, opts.initialMs * 2 ** attempt);6 // Full jitter: uniformly random in [0, exp]7 return Math.floor(Math.random() * exp);8}9 10// Example: attempt 0 → 0-1000ms, attempt 1 → 0-2000ms, attempt 5 → 0-32000ms (capped)Jitter is critical. Without it, every retrying client wakes up at the same time and reproduces the storm. Full jitter (uniform in [0, exp]) is provably as good or better than partial jitter for this case — see the AWS Architecture Blog's seminal post on the subject.
When OpenAI sends a retry-after-ms, use it as a floor — don't retry sooner. But still add some jitter on top to avoid thundering-herd resumes when many requests have been throttled at the same time.
Retry classification
Not every error should retry. A short classifier:
| Status / Error | Action |
|---|---|
| 429 (rate limit) | Retry with backoff, respect retry-after |
| 500, 502, 503, 504 | Retry with backoff (capped) |
| 408, ETIMEDOUT, ECONNRESET | Retry with backoff |
| 400 (bad request) | Do not retry — fix the input |
| 401 (auth) | Do not retry — fix the key |
| 403 (content policy) | Do not retry — review the prompt |
| 404 (not found) | Do not retry — fix the model name |
Retrying on a 400 is a bug, not a fix. Retrying on a 403 (content policy violation) can get your account flagged. The classifier is one of the most valuable ~20 lines you can write.
Model fallback — a last resort
When sustained capacity is unavailable on your primary model (e.g., the bucket has been empty for over a minute), some teams fall back to a smaller or different model. This is a trade-off:
- Pro: the user gets some answer instead of a timeout.
- Con: quality varies, which can be worse for some use cases than failing loudly.
- Con: fallback paths are usually less-tested and accumulate their own bugs.
Use fallback for non-critical features (recommendations, summaries) where degraded quality beats no response. Skip it for evaluation-sensitive flows where consistent output matters.
Capacity planning
Reactive retries are a tactic; capacity planning is the strategy. Three things to track weekly:
- 1Your peak RPM and peak TPM on each model, with a 95th-percentile margin.
- 2Your tier ceilings for those models. Set alerts when you're consistently within 70% of either.
- 3Your `max_tokens` settings. Find the prompts that reserve significantly more output than they use, and tighten them.
Once you're approaching a ceiling, request a tier bump (OpenAI is generally responsive if you can show usage). And consider splitting workloads across multiple OpenAI organizations to multiply your effective ceiling.
Putting it all together
A production-grade OpenAI client looks roughly like this — shared bucket, classifier, backoff, capped attempts:
1import { SimpleQ } from "@simpleq/sdk";2 3const simpleq = new SimpleQ({ apiKey: process.env.SIMPLEQ_API_KEY });4 5// Configure once6await simpleq.rateLimits.set("openai-gpt-4o-mini-prod", {7 requestsPerSecond: 50,8 burst: 1009});10 11// Every call goes through the queue12export async function chat(input: string, userId: string) {13 return simpleq.enqueue({14 queue: "ai-jobs",15 type: "openai.chat",16 idempotencyKey: `chat_${userId}_${hash(input)}`,17 payload: {18 model: "gpt-4o-mini",19 messages: [{ role: "user", content: input }],20 max_tokens: 51221 },22 retry: {23 maxAttempts: 6,24 backoff: "exponential",25 jitter: "full"26 },27 rateLimitKey: "openai-gpt-4o-mini-prod"28 });29}That's the whole pattern. The queue handles retry/backoff/classifier; the rate-limit key handles capacity coordination across your fleet; the idempotency key prevents duplicate work on retries.
If you want this without building the bucket yourself, SimpleQ implements the full pattern — rate-limit keys, retry classifiers, dead-letter queues, and per-job observability. The free tier covers 10,000 executions a month. See the AI use case for an end-to-end example.
Frequently asked questions
Ship reliable async work in minutes.
Free tier covers 10,000 job executions a month. No credit card.