Forem: Behnam Amiri

No, DriftQ Is Not Trying to Be Temporal

Behnam Amiri — Thu, 12 Mar 2026 01:05:27 +0000

I get this question a lot. Somebody sees DriftQ-Core for the first time, reads "durable broker" and "replayable workflows," and the first thing they ask is: "so how is this different from Temporal?"

Totally fair question. Here's the honest answer.

Temporal is incredible. Use it if you need it.

I'm not going to sit here and pretend DriftQ is some kind of competitor to Temporal. Temporal has been worked on for over a decade. It grew out of Uber's Cadence project. Thousands of companies run it in production. Stripe, Netflix, Datadog, real stuff at real scale. The team behind it has basically been thinking about durable execution longer than most of us have been thinking about distributed systems at all.

If you need a full orchestration platform with deterministic workflow replay, multi-service scaling, rich message passing (signals, queries, updates), official SDKs in seven languages, namespace-based multi-tenancy, and a managed cloud option, that's Temporal. Go use it. Seriously.

DriftQ is not that. And it's not trying to be.

So what is DriftQ actually trying to do?

DriftQ started because I kept running into the same problem with AI and LLM pipelines. You've got these multi-step workflows, ingest, chunk, extract, summarize, classify, synthesize, and when something breaks near the end, the recovery plan in most setups is basically: run the whole thing again.

In a normal backend, that's mostly friction: wasted engineer time, wasted compute, and more waiting around for work you've effectively already done. In AI pipelines, it becomes a direct cost problem. Every rerun means buying the same tokens you already paid for. And when you're iterating on prompts and the only thing that changed is the last step, rerunning five upstream LLM calls that haven't changed isn't just annoying, it's a fundamental cost leak.

I wanted something that would give me the reliability basics, retries with backoff, dead-letter queue routing, idempotency, lease-based consumption, without asking me to deploy a cluster. And I wanted replay from a specific step so I could stop paying for work I'd already done.

That's DriftQ. One binary. One process. No external dependencies. WAL-backed durability. A built-in dashboard. You can run it on a $5/month VPS, a local Docker host, or even a Raspberry Pi.

The actual differences, no fluff

Here's how I think about it.

Temporal is a durable execution platform. You adopt its programming model. You write deterministic workflow code, separate your side effects into activities, deploy a multi-service cluster with a persistence database, and Temporal's event-sourced history drives everything forward. That's powerful. And it's a commitment. You're bound by the architecture.

DriftQ is a reliability layer you drop in. The stable v1 surface is a durable message broker: topics, consumer groups, streaming consumption over NDJSON, ack/nack with ownership, lease-based redelivery, retry policies carried with the message, strict DLQ routing, and consume-scope idempotency keys managed by the broker. The v2 layer adds replayable workflow foundations on top, an append-only run/event log, time-travel replay, and step-level artifact reuse.

If Temporal is "workflow engine first," DriftQ is "reliability plumbing first."

If Temporal is a durable distributed async call stack for business logic, DriftQ is a durable work-stream broker you can run as one process with retries, DLQ, and idempotency built in.

Replay means something different when tokens cost money

This is the part that matters most for the AI and agent use case.

In Temporal, replay is about correctness. The event history lets Temporal recreate workflow state after a crash so your code can keep going. That's essential for long-running business processes.

In DriftQ, replay is also about cost. If you have a six-step LLM pipeline and you're tweaking the final synthesis prompt, you don't want to rerun the five upstream summarization calls that haven't changed. Those calls cost real money: input tokens, output tokens, and per-call API fees.

With DriftQ's v2 replay, you can re-drive from the step that actually changed and reuse the intermediate outputs from earlier steps. In How to Cut LLM Waste with DriftQ, I walked through a simple example where that's roughly 66% cheaper for the exact same workflow on the exact same model. The model didn't get cheaper. The workflow just stopped buying the same intermediate work over and over.

At scale with dozens of workflows, hundreds of runs per day, rapid prompt iterations, flaky providers, and retry storms, that waste compounds fast.

What DriftQ is not

I try to be upfront about this in the docs, and I'll say it here too.

DriftQ-Core is not a distributed system. Not yet. It's a single process with file-based durability. If you need multi-node replication or horizontal scaling today, use Kafka and Temporal. I literally say that in the README.

The v2 replayable workflow runtime is real and it works, but it's evolving. I call it "foundations" because that's what it is. The stable surface today is the v1 broker.

The TypeScript SDK is stubbed. The Go and Python SDKs work for the v1 broker API. Governance, RBAC, and tenant isolation are on the roadmap, not shipped.

DriftQ Cloud doesn't exist yet. It's planned.

I'm one developer. Temporal has a company behind it with years of production hardening. That's just the reality.

When to pick which

Pick Temporal when:

You need a mature, battle-tested workflow orchestration platform.
You need multi-service scaling for millions of concurrent workflows.
You need official SDKs across many languages with full support.
You need well-defined security, multi-tenancy, and a managed cloud option.
You're building long-running stateful business processes where correctness under partial failure is the whole point.

Pick DriftQ when:

You want a single-binary deployment with no external dependencies for durable work delivery.
Your core need is safe stream consumption, leases, ack/nack, bounded retries, DLQ, idempotency, not a full workflow orchestration platform.
You're building AI or agent pipelines and you care about not re-paying for upstream work when only a downstream step changed.
You want step-level replay and time-travel debugging as a first-class thing, and you're okay with that being described as evolving v2 foundations.
Simplicity matters more to you than multi-node HA right now.

The short version

Temporal is the planet-scale, enterprise-ready choice. If you need what it offers, nothing I'm building replaces that.

DriftQ is the leaner, AI-ready solution. It's for growing teams and smaller projects that need real durability guarantees without the operational tax of distributed infrastructure. It's for AI workloads where re-execution has a dollar cost and replay from a specific step actually saves you money.

When you outgrow it, that's fine. That's the plan.

Repo: github.com/driftq-org/DriftQ-Core

How to Cut LLM Waste with DriftQ

Behnam Amiri — Mon, 02 Mar 2026 17:45:48 +0000

I have been part of teams where we tried to cut LLM costs the obvious ways: using a cheaper model, trimming prompts, capping output tokens, adding caching, maybe routing smaller tasks to a cheaper tier. All of that helps. But a lot of avoidable spend in production isn't really about model pricing. It's workflow waste. Not the kind you notice immediately, either. The sneaky kind:

Sometimes fails near the end, so the whole workflow has to return.
A flaky provider causes retries that keep redoing the same paid work.
A batch job pushes past safe concurrency and starts slamming the endpoint.
A "self-healing" agent loop keeps spending in the background until somebody notices.

That wasted compute adds up fast. A lot of the time, you are not paying because the model is inherently too expensive. You are paying because your system keeps buying the same work over and over again. That is the layer DriftQ is meant to help with.

DriftQ-Core is an open-source Go project that gives you a durable broker plus replayable workflow runtime foundations in one package. If something fails late, you do not have to restart the whole workflow. If only one downstream step changed, you can replay from that step. If a dependency is flaky, retries are bounded. If concurrency goes sideways, there are controls for that too. DriftQ does not make tokens cheaper.

What it can do is reduce avoidable spend caused by reruns, retries, and repeated execution of unchanged downstream work when prior outputs can be safely reused.

If you are running anything more complex than a single prompt-response call - agents, multi-step chains, batch jobs, RAG pipelines, long-running workflows - that distinction matters.

The expensive part usually is not the model. It is the rerun.

Let's use a concrete example. Assume an example model priced at $1.75 per 1M input tokens and $14.00 per 1M output tokens. Now imagine a daily AI news workflow with six LLM steps:

Summarize article 1
Summarize article 2
Summarize article 3
Summarize article 4
Summarize article 5
Write a final report from all five summaries

Each summary step uses about 5,000 input tokens and returns 800 output tokens.

Per summary step:

Input cost: 5,000 / 1,000,000 x $1.75 = $0.00875
Output cost: 800 / 1,000,000 x $14.00 = $0.01120
Total per summary: $0.01995

Five summaries cost about $0.09975. Now the final report step uses 6,000 input tokens and returns 2,000 output tokens:

Input cost: 6,000 / 1,000,000 x $1.75 = $0.01050
Output cost: 2,000 / 1,000,000 x $14.00 = $0.02800
Final step total: $0.03850

So one clean run costs:

$0.09975 + $0.03850 = $0.13825

That sounds cheap. And that is exactly why this kind of waste slips by people.

Without replay

Let's say the final report is close, but not right. So you tweak the final prompt 10 times. In a naive workflow system, every tweak reruns all six steps. That means:

10 reruns x $0.13825 = $1.38250
plus the original run = $1.52075

Nothing changed in the first five steps. You still paid for them 11 times.

With replay

After the first run, DriftQ can keep the run/event history plus intermediate outputs from earlier steps.

So if all you changed was the final synthesis prompt, and the upstream outputs are still valid, you replay from the final step instead of replaying the whole workflow.

That means:

10 replays x $0.03850 = $0.38500
plus the original run = $0.52325

That is about 66% cheaper for the exact same workflow and the exact same model. The model did not get cheaper. The workflow just stopped buying the same intermediate work over and over again.

Why this gets ugly in the real world

The example above is small on purpose. In real systems, the waste is usually worse. Think about the shape of a typical AI pipeline:

ingest
chunk
extract
summarize
classify
synthesize
write results somewhere

Now imagine the last step fails because of a formatting bug, a provider hiccup, a timeout, or a downstream API issue. In a lot of systems, the recovery plan is basically one thing:

Run it all again.

That means paying again for every upstream LLM call even though nothing new was learned. And once you multiply that by:

dozens of workflows
hundreds of runs per day
repeated prompt iteration
flaky external dependencies
weak retry discipline

the waste compounds fast. That is why replay from the failure point is not just a debugging convenience in AI systems. It changes the economics of failure.

What DriftQ actually gives you

DriftQ is not trying to be magical. It gives you a set of practical primitives.

On the broker side

retries with backoff
dead-letter queue routing
idempotency keys
consumer leases
backpressure and max-inflight controls
WAL-backed durability

On the workflow side

an append-only run/event log
time-travel replay
artifact storage and reuse
budget controls for tokens, dollars, attempts, and wall-clock time
inspectable timelines for debugging

That combination matters because LLM cost problems usually do not come from one mistake. They come from a chain reaction:

retry logic is sloppy
replay is missing
concurrency is too loose
debugging means rerunning everything
budgets exist only in somebody's head

DriftQ gives you concrete controls for those failure modes.

Try it yourself

If you want to see what the repo actually offers today, this is the fastest way to do it.

Important: the built-in demo in this repo is a minimal two-step workflow with nodes A -> B. It proves replay, timelines, and artifact reuse, but it is not the same thing as the six-step AI-news example above.

Run DriftQ-Core

docker run --rm -p 8080:8080 -v driftq_data:/data ghcr.io/driftq-org/driftq-core:1.2.0

Confirm the server is up

curl http://127.0.0.1:8080/v1/healthz

Build the CLI

go build -o driftqctl ./cmd/driftqctl

Run the demo workflow

./driftqctl --base-url http://127.0.0.1:8080 runs demo

Copy the run_id from that command's output.

Check run status

./driftqctl --base-url http://127.0.0.1:8080 runs status --run-id <RUN_ID>

Inspect the timeline

./driftqctl --base-url http://127.0.0.1:8080 runs timeline --run-id <RUN_ID>

Replay from the downstream demo step

./driftqctl --base-url http://127.0.0.1:8080 runs replay --run-id <RUN_ID> --from-step B --mode time-travel

View stored artifacts

./driftqctl --base-url http://127.0.0.1:8080 runs artifacts --run-id <RUN_ID>

That sequence is the point. In the built-in demo, replaying from B lets you re-drive the downstream step without replaying A. In a real AI workflow, the same idea applies to whatever your actual downstream node is called: reuse the outputs you already paid for and only re-execute the part that actually changed.

Replay is the headline, but it is not the only savings

Replay is the easiest benefit to explain, but it is not the only place the bill goes down.

1. Retry storms stop turning outages into invoices

When a dependency starts failing, bad retry logic can get expensive very quickly.

A lot of systems basically do this:

fail
retry immediately
fail again
retry harder
repeat until everybody is sad

DriftQ gives you bounded attempts, backoff, and DLQ routing. So instead of "keep burning money until the provider recovers," you get "fail sanely, preserve state, and quarantine the bad work."

2. Concurrency mistakes stop multiplying damage

One of the fastest ways to waste money is to fan out too aggressively, hit rate limits, and then combine successful requests, failed requests, and retries into one giant mess.

DriftQ's backpressure and max-inflight controls help stop overload from turning into paid chaos.

3. Runaway agents get budget fences

If you have ever had an agent loop longer than expected, you know how ugly this can get.

DriftQ includes budget controls for:

max attempts
token budgets
dollar budgets
wall-clock time

So when something goes off the rails, the system can stop itself before your monthly bill makes the decision for you.

The honest caveats

DriftQ is not a magic trick. It will not:

shorten your prompts for you
improve bad prompts automatically
make a weak model smarter
turn a bad agent design into a good one
replace careful application design

And it is also not pretending to be Kafka at massive distributed scale. What it does is narrower, and for a lot of teams, more useful:

durable execution
replay
artifact reuse
disciplined retries
budget controls

That means fewer avoidable reruns, less accidental spend, and much better visibility into where your AI workflow is actually wasting money.

Who should care about this right now

If any of these sound familiar, this is probably worth a look:

"Why did this workflow restart from step 1 again?"
"Why are we calling the model when nothing changed?"
"Why did the batch worker hammer the provider like that?"
"Why did this agent keep spending money all night?"
"Why does debugging this workflow cost money every single time?"

That is exactly the class of problem DriftQ is built for.

It is one Go binary with file-backed durability. No extra fleet, no giant platform tax, no stack of dependencies just to get reliable workflows, replay, retries, and control.

For small teams, startups, and solo builders building AI systems, that tradeoff makes a lot of sense.

Final thought

I think a lot of teams are trying to cut LLM costs at the wrong layer. Yes, model pricing matters. But if your workflow keeps rerunning expensive work, handling retries badly, or letting agents wander without guardrails, you are going to burn money no matter how much you trim prompts. DriftQ goes after that waste layer, and in a lot of real production systems, that's where the biggest savings are.

Repo: github.com/driftq-org/DriftQ-Core