Architecture

Building async infrastructure without overengineering

When to reach for Redis, when to reach for a hosted queue, and when to skip both. A guide to staying simple as your traffic grows.

·10 min read
TL;DR

Most teams overbuild async infrastructure too early — Kafka before they have streams, Temporal before they have workflows, custom queues before they have queue load. The right move is to start simple and graduate when you hit specific signals, not based on what other teams use. This post walks through the stages and the signals.

Async infrastructure has a strange reputation for being either trivial (it's just a list, right?) or terrifying (we'd better adopt Temporal). The truth is that most teams need something in the middle, and the path between stages is well-trodden. The mistake isn't choosing the wrong tool — it's choosing two stages ahead of where you are.

This post is the progression we recommend to teams. It's deliberately uncomfortable, because at each stage we'll tell you to be simpler than your peers, and to graduate only when something specific breaks.

Stage 0: do it inline

Before you have async infrastructure, you have a function that calls the thing. That's fine for a long time. The decision to add async is governed by one question: does the user have to wait for this to finish?

If the user is waiting for an email confirmation that takes 200ms, just send it inline. If they're waiting on a 30-second AI summary, that's the first signal — async is the right call.

Inline is undervalued

Engineers love async because it 'feels right.' But inline code is observable, debuggable, and one stack trace away from being understood. Don't move work to async because it's modern — move it because the latency or failure profile demands it.

Stage 1: setTimeout / setImmediate (don't)

Every team writes this version at least once: 'oh I'll just setTimeout the slow part.' The slow part runs in the same process, after the response is sent. It works perfectly until your process crashes, restarts, or scales horizontally — at which point you discover you've invented a fire-and-forget queue with zero durability.

Skip this stage. The minute you'd reach for setTimeout, instead reach for a durable queue. The leap is small and you don't lose your weekend to a missing email three months from now.

Stage 2: hosted queue with retries and scheduling

This is where most teams should live for the longest. A managed queue with the core primitives — enqueue, retry with backoff, scheduled jobs, rate limits, dead-letter, observability — replaces 80% of what bespoke async infrastructure does. You write workers; the service runs them.

The cost is roughly 'a few thousand a year for what would otherwise be a few hundred hours of engineering plus an ops surface.' For most teams that math is a no-brainer. (This is, of course, what SimpleQ is.)

Signals that you've outgrown stage 2 are rare, and most teams never hit them. They look like:

  • You need multi-step coordination with durable state across steps (workflow engine territory).
  • You need event-streaming fan-out to many consumers (Kafka territory).
  • You have regulatory replay requirements (workflow engine territory).
  • Your per-job cost makes managed pricing untenable at your scale (rare; usually 10M+ jobs/month).

If none of those apply, stay at stage 2 until they do.

Stage 3a: self-hosted BullMQ / Sidekiq / Resque

Some teams reach for self-hosted queue libraries on top of Redis. BullMQ (Node), Sidekiq (Ruby), and Resque (Ruby) are battle-tested and free. They give you queueing primitives and let you operate your own dashboards, workers, and Redis cluster.

The honest trade is: you save the queue service's bill, you pay in operations. You debug Redis OOMs, you build your own retry classifiers, you write your own rate-limit primitives, you compose your own scheduler. None of these is hard in isolation; collectively they're a part-time job.

Self-host when one of the following is true:

  • You already operate Redis well, and the queue is a small marginal addition.
  • Compliance or data-residency requirements rule out a managed service.
  • Your scale makes managed pricing genuinely uneconomical (uncommon).

Stage 3b: Postgres as a queue

SELECT ... FOR UPDATE SKIP LOCKED is a real pattern and works fine at low volumes. You already have Postgres; one more table is cheap. For under 10,000 jobs/day with simple retry needs, this is actually a defensible choice.

It stops working when:

  • Write amplification from re-enqueues fights with your primary workload for I/O.
  • Autovacuum can't keep up with row churn on the jobs table.
  • You need queue-specific features (rate limits, priorities) that require schema gymnastics.
  • Your queue's read load competes with your main app's read load.

Graduate before it hurts, not after. Postgres-as-a-queue is a good intermediate; it's not a destination.

Stage 4: workflow engine (only if you need it)

Temporal, Cadence, Inngest, and Trigger.dev are workflow engines: queues plus durable state plus replay. They're the right answer for multi-step coordination problems. They're the wrong answer for everything else.

See Queue vs workflow engine for the decision framework. Short version: if you can't describe a specific multi-step coordination problem your current stack can't handle, you don't need stage 4.

Stage 5: event bus (Kafka / Kinesis / Pulsar)

Kafka is excellent at what it does — high-throughput event streaming with replay, multiple independent consumers, and ordered partitions. It is not a job queue. It has no concept of per-job retries, dead-letters, or rate limits. Building those on top is a serious investment.

Reach for Kafka when you have streaming use cases (analytics pipelines, change data capture, large fan-out broadcasts). Don't reach for it because you have 'lots of events.' Most teams have lots of events that are actually jobs.

The anti-pattern: choosing two stages ahead

The most common failure mode is choosing tooling two stages ahead of your actual need. Specifically:

  • Adopting Kafka because you'll 'eventually need it' — and then using it as a job queue.
  • Adopting Temporal because workflow engines feel more enterprise — and then writing every workflow as a single activity.
  • Building a custom queue on Postgres because 'we already have Postgres' — and reimplementing every primitive a managed queue gives you for free.

The cost of being two stages ahead isn't just dollars; it's onboarding time, ops surface, and the friction of building on a primitive whose capabilities you don't yet use. Be one stage ahead at most, and graduate deliberately.

Summary

StageToolWhen to graduate
0Inline function callLatency or failure profile demands async
1setTimeout (skip!)Always — go straight to stage 2
2Managed queueMulti-step durable state, streaming, or regulation
3aSelf-hosted BullMQ/SidekiqSame as stage 2; or back to stage 2
3bPostgres-as-queue~10k jobs/day or autovacuum pressure
4Workflow engineYou can name the specific coordination need
5Event busStreaming, fan-out, replay — not jobs

Most teams stay at stage 2 for years and never need anything else. The teams that graduate earliest aren't ahead — they're just paying for capability they don't yet use. Be deliberate about the cost of each step.

If you're at stage 0 or 1 today, SimpleQ is the no-regret stage 2 choice. Free tier covers 10,000 executions/month; see the docs for the Quickstart or the use cases for end-to-end patterns.

Frequently asked questions

Async infrastructure is the set of systems that run work outside of a user request — background jobs, scheduled tasks, retries, webhooks, AI calls, and bulk operations. It typically includes a queue, workers, a scheduler, and observability.
Try SimpleQ

Ship reliable async work in minutes.

Free tier covers 10,000 job executions a month. No credit card.