Is Postgres a good queue?

For low volumes (under 10,000 jobs/day) and where you already operate Postgres, yes — `SELECT ... FOR UPDATE SKIP LOCKED` is a workable pattern. For higher volumes, you'll fight write amplification, autovacuum pressure, and complex retry handling. Graduate to a dedicated queue before it becomes painful, not after.

When should I move from BullMQ to a managed queue?

When operating Redis becomes a real cost: when you've debugged at least one outage caused by Redis memory pressure, when you've built your own dashboards because the open-source ones are limited, or when you need rate limiting across workers and have built it twice. Around 100,000 jobs/day is when the math usually flips.

Do I ever need Kafka for async work?

Rarely. Kafka excels at event-streaming use cases (analytics, change data capture, large-fan-out broadcasts). It's a poor fit for typical background jobs — no retries, no per-job dead-letter, no per-key rate limits. Reach for it for streaming, not for jobs.

How do I know I have an overengineering problem?

Signs: you operate more queue/workflow infrastructure than the application code uses, your onboarding doc for the queue is longer than the README, or there's a person on the team whose unstated job is 'keep the queue running.' At that point, a hosted queue almost always wins.

Building async infrastructure without overengineering

TL;DR

Most teams overbuild async infrastructure too early — Kafka before they have streams, Temporal before they have workflows, custom queues before they have queue load. The right move is to start simple and graduate when you hit specific signals, not based on what other teams use. This post walks through the stages and the signals.

Async infrastructure has a strange reputation for being either trivial (it's just a list, right?) or terrifying (we'd better adopt Temporal). The truth is that most teams need something in the middle, and the path between stages is well-trodden. The mistake isn't choosing the wrong tool — it's choosing two stages ahead of where you are.

This post is the progression we recommend to teams. It's deliberately uncomfortable, because at each stage we'll tell you to be simpler than your peers, and to graduate only when something specific breaks.

Stage 0: do it inline

Before you have async infrastructure, you have a function that calls the thing. That's fine for a long time. The decision to add async is governed by one question: does the user have to wait for this to finish?

If the user is waiting for an email confirmation that takes 200ms, just send it inline. If they're waiting on a 30-second AI summary, that's the first signal — async is the right call.

Inline is undervalued

Engineers love async because it 'feels right.' But inline code is observable, debuggable, and one stack trace away from being understood. Don't move work to async because it's modern — move it because the latency or failure profile demands it.

Stage 1: setTimeout / setImmediate (don't)

Every team writes this version at least once: 'oh I'll just setTimeout the slow part.' The slow part runs in the same process, after the response is sent. It works perfectly until your process crashes, restarts, or scales horizontally — at which point you discover you've invented a fire-and-forget queue with zero durability.

Skip this stage. The minute you'd reach for setTimeout, instead reach for a durable queue. The leap is small and you don't lose your weekend to a missing email three months from now.

Stage 2: hosted queue with retries and scheduling

This is where most teams should live for the longest. A managed queue with the core primitives — enqueue, retry with backoff, scheduled jobs, rate limits, dead-letter, observability — replaces 80% of what bespoke async infrastructure does. You write workers; the service runs them.

The cost is roughly 'a few thousand a year for what would otherwise be a few hundred hours of engineering plus an ops surface.' For most teams that math is a no-brainer. (This is, of course, what SimpleQ is.)

Signals that you've outgrown stage 2 are rare, and most teams never hit them. They look like:

You need multi-step coordination with durable state across steps (workflow engine territory).
You need event-streaming fan-out to many consumers (Kafka territory).
You have regulatory replay requirements (workflow engine territory).
Your per-job cost makes managed pricing untenable at your scale (rare; usually 10M+ jobs/month).

If none of those apply, stay at stage 2 until they do.

Stage 3a: self-hosted BullMQ / Sidekiq / Resque

Some teams reach for self-hosted queue libraries on top of Redis. BullMQ (Node), Sidekiq (Ruby), and Resque (Ruby) are battle-tested and free. They give you queueing primitives and let you operate your own dashboards, workers, and Redis cluster.

The honest trade is: you save the queue service's bill, you pay in operations. You debug Redis OOMs, you build your own retry classifiers, you write your own rate-limit primitives, you compose your own scheduler. None of these is hard in isolation; collectively they're a part-time job.

Self-host when one of the following is true:

You already operate Redis well, and the queue is a small marginal addition.
Compliance or data-residency requirements rule out a managed service.
Your scale makes managed pricing genuinely uneconomical (uncommon).

Stage 3b: Postgres as a queue

SELECT ... FOR UPDATE SKIP LOCKED is a real pattern and works fine at low volumes. You already have Postgres; one more table is cheap. For under 10,000 jobs/day with simple retry needs, this is actually a defensible choice.

It stops working when:

Write amplification from re-enqueues fights with your primary workload for I/O.
Autovacuum can't keep up with row churn on the jobs table.
You need queue-specific features (rate limits, priorities) that require schema gymnastics.
Your queue's read load competes with your main app's read load.

Graduate before it hurts, not after. Postgres-as-a-queue is a good intermediate; it's not a destination.

Stage 4: workflow engine (only if you need it)

Temporal, Cadence, Inngest, and Trigger.dev are workflow engines: queues plus durable state plus replay. They're the right answer for multi-step coordination problems. They're the wrong answer for everything else.

See Queue vs workflow engine for the decision framework. Short version: if you can't describe a specific multi-step coordination problem your current stack can't handle, you don't need stage 4.

Stage 5: event bus (Kafka / Kinesis / Pulsar)

Kafka is excellent at what it does — high-throughput event streaming with replay, multiple independent consumers, and ordered partitions. It is not a job queue. It has no concept of per-job retries, dead-letters, or rate limits. Building those on top is a serious investment.

Reach for Kafka when you have streaming use cases (analytics pipelines, change data capture, large fan-out broadcasts). Don't reach for it because you have 'lots of events.' Most teams have lots of events that are actually jobs.

The anti-pattern: choosing two stages ahead

The most common failure mode is choosing tooling two stages ahead of your actual need. Specifically:

Adopting Kafka because you'll 'eventually need it' — and then using it as a job queue.
Adopting Temporal because workflow engines feel more enterprise — and then writing every workflow as a single activity.
Building a custom queue on Postgres because 'we already have Postgres' — and reimplementing every primitive a managed queue gives you for free.

The cost of being two stages ahead isn't just dollars; it's onboarding time, ops surface, and the friction of building on a primitive whose capabilities you don't yet use. Be one stage ahead at most, and graduate deliberately.

Summary

Stage	Tool	When to graduate
0	Inline function call	Latency or failure profile demands async
1	setTimeout (skip!)	Always — go straight to stage 2
2	Managed queue	Multi-step durable state, streaming, or regulation
3a	Self-hosted BullMQ/Sidekiq	Same as stage 2; or back to stage 2
3b	Postgres-as-queue	~10k jobs/day or autovacuum pressure
4	Workflow engine	You can name the specific coordination need
5	Event bus	Streaming, fan-out, replay — not jobs

Most teams stay at stage 2 for years and never need anything else. The teams that graduate earliest aren't ahead — they're just paying for capability they don't yet use. Be deliberate about the cost of each step.

If you're at stage 0 or 1 today, SimpleQ is the no-regret stage 2 choice. Free tier covers 10,000 executions/month; see the docs for the Quickstart or the use cases for end-to-end patterns.

Frequently asked questions

Async infrastructure is the set of systems that run work outside of a user request — background jobs, scheduled tasks, retries, webhooks, AI calls, and bulk operations. It typically includes a queue, workers, a scheduler, and observability.

Try SimpleQ

Ship reliable async work in minutes.

Free tier covers 10,000 job executions a month. No credit card.

Start Free Read the docs

← Previous post

Why webhook retries matter