AI

Why every AI app needs a reliable execution layer — and what one looks like in production

LLM calls fail, hit rate limits, and time out — and inline calls are quietly the biggest reliability risk in modern AI apps. Here's how to fix it.

·8 min read
TL;DR

LLM calls fail more often than HTTP-200-eyed engineers expect: timeouts, 429s, token-budget exhaustion, partial responses, provider outages. Calling them inline from request handlers turns every transient failure into a user-facing failure. An execution layer — queue, retry, rate-limit, observe — is the single highest-leverage piece of infrastructure for any AI product in production.

Most AI apps start the same way. You wire OpenAI into a request handler, ship it, and it works. Then traffic doubles, OpenAI ships a model update, your prompt grows, and one Tuesday afternoon your error rate quietly climbs from 0.3% to 4%. You start adding try-catches. You add a retry. You wonder why your latency p99 doubled. You add a circuit breaker. Six months later you've reimplemented half of a queueing system, badly, inside a request thread.

This post is about the abstraction you should reach for instead: an execution layer. We'll define it, show what it does and doesn't do, walk through the failure modes it eliminates, and look at the code that replaces ~600 lines of homegrown retry plumbing.

What is an execution layer?

An execution layer is a service that sits between your application and external APIs. Your code enqueues a job; the execution layer is responsible for running it reliably. Concretely, it provides four things:

  • Durable queueing. Once enqueued, a job is not lost, even if your app process dies, your deploy rolls forward, or a worker OOMs.
  • Retry with backoff. Transient failures (429, 5xx, network resets) retry automatically with exponential backoff and jitter, up to a configurable cap. Permanent failures land in a dead-letter queue you can inspect.
  • Rate-limit awareness. A shared token bucket per upstream (e.g. openai-prod) paces requests across all of your workers so you never exceed the provider's limit.
  • Observability. Every attempt is logged with its payload, response, latency, and error. You can grep, retry, replay, and alert on a single job's history.

You can think of it as "the part of your app that runs the unreliable code, separated from the part that responds to users." The synchronous part stays fast and predictable; the asynchronous part absorbs the chaos of the outside world.

What goes wrong when you call LLMs inline

Before defending the pattern, it's worth being specific about the failure modes inline LLM calls produce in production. These are not edge cases — they happen to every AI product at scale.

1. Timeouts cascade

LLM responses can take 30+ seconds for complex prompts, longer with reasoning models. If your web server's request timeout is 30 seconds (and your load balancer's is 60), a slow OpenAI response becomes a user-visible error long before the model finishes. Worse, the upstream call keeps consuming tokens you paid for, after the user gave up.

2. Rate limits arrive in bursts

OpenAI enforces tokens-per-minute (TPM) and requests-per-minute (RPM) limits per organization, per key. When you hit them, you get a 429 with a retry-after header. If three of your workers each retry independently, you've now multiplied the load — and the 429s — by three. Without a shared bucket, every retry storm is self-inflicted.

3. Deploys lose in-flight work

When a request thread is mid-OpenAI-call and your deploy starts draining, you have two options: kill the request (user error) or wait the full timeout window (slow rollout). Neither is good. Durable queues sidestep the problem entirely — the job survives, a worker picks it up after the deploy.

4. Cost blowups from naive retries

A retry on an LLM call costs real money. If your retry policy is 'retry on any error, up to 5 times,' a flaky network hiccup can 5x your spend on a single prompt. Without idempotency keys and a careful retry classifier, retries also create duplicate downstream effects — duplicate database writes, duplicate notifications, duplicate billing events.

5. Observability gaps

Inline calls leave breadcrumbs in your app logs, but they're scattered. To answer 'how did request X actually go?', you have to correlate by trace ID across services. An execution layer gives you a single record per job with every attempt, every payload, every response — and a one-click retry button.

The execution layer pattern: enqueue and acknowledge

Once you adopt an execution layer, your request handler stops being responsible for the LLM call. It does two things: validate the input, then enqueue a job and return. Something like:

server/routes/summarize.ts
ts
1import { SimpleQ } from "@simpleq/sdk";
2 
3const simpleq = new SimpleQ({ apiKey: process.env.SIMPLEQ_API_KEY });
4 
5export async function POST(req: Request) {
6 const { text, userId } = await req.json();
7 
8 const job = await simpleq.enqueue({
9 queue: "ai-jobs",
10 type: "openai.chat",
11 idempotencyKey: `summary_${userId}_${hash(text)}`,
12 payload: {
13 model: "gpt-4o-mini",
14 messages: [
15 { role: "system", content: "Summarize concisely." },
16 { role: "user", content: text }
17 ]
18 },
19 retry: { maxAttempts: 5, backoff: "exponential" },
20 rateLimitKey: "openai-prod"
21 });
22 
23 return Response.json({ jobId: job.id });
24}

The client polls or subscribes for completion. When the job succeeds, your worker callback runs — write the result to the database, notify the user, kick off the next step. When it fails after all retries, it lands in the dead-letter queue and you get an alert.

Compare that to the inline equivalent, which has to handle the OpenAI SDK call, retry logic, rate limiting, timeouts, logging, and database writes — all while holding the request open. The inline version is also where every bug lives.

Streaming and the execution layer

For interactive chat, you still want streaming. The execution layer doesn't replace your streaming endpoint — it sits behind it. Use it for anything that doesn't need to stream: summaries, embeddings, agents, evaluations, batch transforms, scheduled work. For streaming endpoints, share the same rate-limit bucket so streamed traffic and queued traffic don't fight for capacity.

What good looks like

A few signals that your execution layer is doing its job:

  • Your p99 latency on user-facing endpoints is independent of OpenAI's latency.
  • When OpenAI has a partial outage, your error rate is a flat line and your queue depth temporarily climbs — instead of your error rate climbing and your support inbox filling up.
  • You can answer 'what happened on this user's request?' in under 30 seconds by clicking through to the job in your dashboard.
  • Retries are silent. The user doesn't see them, your alerting doesn't fire on them, and your bill reflects them.
  • Adding a new AI feature doesn't mean reimplementing retry logic.

What it isn't

An execution layer is not a workflow engine. It runs single jobs reliably; it does not maintain durable state across many steps with timers and signals (that's Temporal's territory). It is also not a vector database, a prompt evaluator, or a model router. It's a primitive — and like most primitives, you compose it with other things.

It's also not a substitute for thinking about prompts, model selection, evaluation, or cost. Reliable execution makes everything else easier to build — but you still have to build them.

Getting started

If you have an AI app in production and you're calling LLMs inline, the highest-leverage thing you can do this quarter is move those calls behind an execution layer. The migration is usually small: one endpoint enqueues, one worker callback runs the actual call, and your retry/rate-limit/observability story goes from 'TODO' to 'done' in a day.

If you want to skip the bespoke part, that's exactly what SimpleQ is for. The free tier covers 10,000 executions a month — enough to migrate a real AI feature and see it working before you commit. Read the Quickstart or look at the AI job processing use case for the end-to-end pattern.

Frequently asked questions

An execution layer is a service that sits between your application and external AI APIs (like OpenAI, Anthropic, or Replicate). It queues calls, retries failures, enforces rate limits, and gives you per-job logs — so a single 429 from OpenAI doesn't turn into a 500 from your app.
Try SimpleQ

Ship reliable async work in minutes.

Free tier covers 10,000 job executions a month. No credit card.