Bar-Dov

Posted on Jun 8

Using BullMQ to Power AI Workflows (with Observability in Mind)

#bullmq #redis #node #queues

As AI-based applications become more sophisticated, managing their asynchronous tasks becomes increasingly complex. Whether you’re generating content, processing embeddings, or chaining together multiple model calls—queues are essential infrastructure.

And for many Node.js applications, BullMQ has become the go-to queueing library.

In this post, we’ll walk through why BullMQ fits well into AI pipelines, and how to handle some of the pitfalls that come with running critical async work at scale.

Why BullMQ Makes Sense for AI Workflows

AI jobs are often:

CPU/GPU intensive (model inference)
Long running (fine-tuning, summarizing large chunks)
Chainable (one output feeds the next)
Best handled asynchronously

Queues help break down these processes into manageable, distributed units.

Example: A Simple AI Pipeline with BullMQ

Let’s say you’re building a summarization service.

User submits a document.

The job is queued.

A worker generates the summary.

A follow-up task sends it via email.

Here’s how you might structure that with BullMQ:

// queues.ts
import { Queue } from 'bullmq';
import { connection } from './redis-conn';

export const summarizationQueue = new Queue('summarize', { connection });
export const emailQueue = new Queue('email', { connection });

// producer.ts
await summarizationQueue.add('summarizeDoc', {
  docId: 'abc123',
});

// summarization.worker.ts
import { Worker } from 'bullmq';
import { summarizationQueue, emailQueue } from './queues';

new Worker('summarize', async job => {
  const summary = await generateSummary(job.data.docId);

  await emailQueue.add('sendEmail', {
    userId: job.data.userId,
    summary,
  });
});

You can imagine how this might expand:

Queue for transcription
Queue for sentiment analysis
Queue for search index updates

What to Watch Out For

When you're handling large numbers of AI jobs:

Memory usage spikes can crash your Redis instance.

Worker failures can leave queues silently stuck.

Job retries without proper limits can pile up fast.

These are hard to track without some sort of observability layer.

Good Practices for AI Queue Systems

✅ Use job removeOnComplete: true to avoid memory buildup
✅ Set attempts and backoff on your long-running jobs
✅ Monitor failed jobs & queue lengths
✅ Alert on missing workers or high backlog

Even a minimal dashboard that shows which queues are stuck or which workers are down can save hours.

We had to build one ourselves. If you’re looking for something simple and focused, we put together a tool called Upqueue.io that visualizes BullMQ jobs and alerts you when things go wrong. But whether it’s a custom script, Prometheus, or something else - just make sure you’re not flying blind.

BullMQ is a great fit for AI apps. But the more you scale, the more you need to see what’s going on.

Don’t let your GPT worker crash at 3am without you knowing.

Monitor early. Sleep better.

Top comments (5)

tomerjann • Jun 10

Great read, thanks for sharing ! I’ve been using BullMQ for some time and it’s been super reliable for chaining jobs and handling heavy tasks. The observability part really resonated—definitely had moments where things failed silently. Upqueue sounds interesting , might check it out. Anyone here tried it in a real project?

Bar-Dov • Jun 14

Thanks for the kind words! 🙏
Yeah, BullMQ itself is great at what it does, but the lack of built-in visibility definitely caught us off guard in production. That’s actually what led me to build Upqueue - we wanted something simple that shows if a queue gets stuck or a worker silently dies before users start yelling 😅

Still early days but it’s already helped us avoid a few scary incidents. If you do give it a spin, I’d love to hear how it holds up in your setup!

Nevo David • Jun 11

been cool seeing steady progress in stuff like this - always makes me think if it’s just the tooling or if habits around monitoring really matter more long-term?

Bar-Dov • Jun 14

Totally get you @nevodavid - I’ve come to believe it’s both.
Even the best tooling won’t help if you’re not in the habit of watching what matters or reacting early. But good tooling can nudge you into better habits. For example, once we got real-time alerts for queue delays, we started building playbooks for edge cases we used to ignore 😅

So yeah, tooling helps a lot, but the real win is when it quietly encourages you to care earlier.
It's all a matter of standards (and impact TBH).

Shubhankar Gadad • Jun 17

Really excited about BullMQ after reading this article. Would like to get hands on this. Also, reading about Upqueue, would like to look deeper into that too