Forem: Shivam Saluja

Sync-over-Async: Bypassing Azure Service Bus Session Limits for AI Workloads

Shivam Saluja — Wed, 08 Apr 2026 09:58:29 +0000

How to bridge legacy HTTP clients to long-running AI tasks without 504 Timeouts or Stateful Bottlenecks.

The business wants you to integrate a new LLM feature. You wire up a
standard REST endpoint, deploy it, and it works flawlessly in testing. Then it hits production. The AI takes 45 seconds to generate a response during peak load. Your API Gateway drops the connection at 30 seconds. The client gets a 504 Gateway Timeout, the user furiously clicks retry, and suddenly you have a thundering herd that takes down your entire connection pool.

Welcome to the era of AI workloads on legacy HTTP infrastructure.

Standard REST APIs are built for speed. AI workloads are fundamentally slow. If you do not decouple them, your architecture will eventually shatter under the weight of holding thousands of long-running HTTP threads open.

The "Anti-Pattern" Lifeline: Sync-over-Async

In a perfect world, your clients would be fully event-driven, communicating over WebSockets or Server-Sent Events. In the real world, you have legacy mobile apps, older frontends, and strict partner webhooks that only speak one language: they send an HTTP POST and they expect a 200 OK with a JSON payload immediately. You cannot force them to implement an Azure Service Bus listener.

This is where the Sync-over-Async Gateway comes in.

It is an edge integration pattern where a Gateway receives a synchronous HTTP request, converts it into an asynchronous message on a broker (like Azure Service Bus), waits for the backend worker to process it, and then maps the reply back to the original HTTP connection.

Azure Service Bus Sessions

When engineers build this on Azure, the immediate instinct is to use Service Bus Sessions.

The Gateway sends a message with SessionId = 123.
The Gateway blocks and listens to a reply queue exclusively for SessionId = 123.
The Worker processes the task and sends the reply with SessionId = 123.

This works beautifully on a single machine. At scale, it could be a disaster.

If you have 50 Gateway instances behind a load balancer, how does the reply get back to the exact instance holding the open HTTP connection? If you use Sessions, your system becomes deeply stateful. Instance #1 has to explicitly request the lock for Session 123. If Instance #1 crashes, that session is locked until it times out. Furthermore, Azure Service Bus Standard tier enforces hard limits on concurrent sessions, meaning a traffic spike will instantly exhaust your namespace.

Sessions force you to manage stateful routing across a distributed cluster. It breaks horizontal elasticity.

The Fix: Stateless Filtered Topics

To achieve true horizontal scale, the Gateway layer must be 100% stateless. Instead of using locked sessions, we can push the routing logic down to the broker using a Filtered Topic Pattern.

Explicit Addressing: The Gateway injects a unique ReplyToInstance property into the request (e.g., Instance-A).
Dynamic Subscriptions: On startup, each Gateway creates a lightweight, temporary subscription on a global reply topic with a SQL rule: ReplyToInstance = 'Instance-A'.
Broker-Side Routing: When the backend worker finishes, it attaches the same property to the reply. The Azure broker evaluates the SQL filter and pushes the message only to the specific Gateway pod waiting for it.

No session locks. No implicit instance affinity. Complete horizontal scalability.

Breaking Down the Stateless Architecture

If you look at the architecture diagram above, here is exactly how we eliminate the Session bottleneck and achieve infinite horizontal scale:

1. The Synchronous Edge (Left Side)
The client sends a standard, blocking HTTP REST request. Our Load Balancer distributes this to any available Gateway Replica (e.g., Replica 1). Because our Gateway is completely stateless, the load balancer doesn't need to worry about sticky sessions.

2. The Asynchronous Handoff (Middle)
Replica 1 takes the HTTP payload and publishes it to the Azure Service Bus Request Topic.
Crucially, it does NOT open a Service Bus Session. Instead, it generates a unique CorrelationId (e.g., replica1_reqA) and includes it in the message properties. Immediately, Replica 1 spins up a lightweight, dynamic subscription on the Reply Topic with a strict SQL Filter: CorrelationId = 'replica1_reqA'.

3. The AI Worker Layer (Right Side)
Your long-running AI workers operate as standard, competing consumers. A worker pulls the request from the topic, processes the heavy LLM prompt for 45 seconds, and generates the result. To send the result back, the worker simply attaches that exact same CorrelationId to the response message and drops it onto the global Reply Topic.

4. Broker-Side Routing (The Magic)
This is where the architecture shines. The Gateway instances are not actively polling or fighting over locked sessions. The Azure Service Bus broker evaluates the incoming reply message, reads CorrelationId = 'replica1_reqA', matches it to Replica 1's dynamic SQL filter, and pushes the message directly down that specific pipe.

Replica 1 receives the answer, maps it back to the open HTTP thread, and returns the 200 OK to the client. If Replica 1 had crashed during those 45 seconds, its temporary subscription would simply vanish—no locked sessions, no frozen resources, and no blocked queues.

Introducing Sentinel: The Open-Source Starter

Implementing dynamic Service Bus Administration clients, processor lifecycles, and thread management is complex. To solve this, I built Sentinel—an open-source Spring Boot starter that abstracts this entire pattern into a single library dependency.

Here is how you can completely decouple your HTTP APIs from your slow AI workers in just a few lines of code.

1. Add the Dependency

<dependency>
    <groupId>io.github.shivamsaluja</groupId>
    <artifactId>sentinel-servicebus-starter</artifactId>
    <version>1.0.0</version>
</dependency>

2. The Zero-Boilerplate Configuration
Sentinel handles all the Azure SDK heavy lifting. Just point it to your queues in application.yml:

sentinel:
  servicebus:
    connection-string: "Endpoint=sb://your-namespace.servicebus.windows.net/;SharedAccessKeyName=...;"
    request-queue: "ai-task-requests"
    reply-topic: "ai-task-replies"

3. The Gateway Controller (The Magic)
By returning a CompletableFuture, we instantly free up the Tomcat HTTP thread. The client's connection remains open, but the server resources are released, allowing massive concurrency.

@RestController
@RequestMapping("/api/v1/ai")
public class GatewayController {

    private final SentinelTemplate sentinelTemplate;

    public GatewayController(SentinelTemplate sentinelTemplate) {
        this.sentinelTemplate = sentinelTemplate;
    }

    @PostMapping("/generate")
    public CompletableFuture<ResponseEntity<String>> generateReport(@RequestBody String prompt) {

        // 1. Send the prompt to the Service Bus and wait.
        // Under the hood, Sentinel manages the dynamic SQL subscription.
        CompletableFuture<String> asyncReply = sentinelTemplate.sendAndReceive(prompt);

        // 2. Map the asynchronous reply back to a standard HTTP 200 OK.
        return asyncReply.thenApply(reply -> {
            return ResponseEntity.ok(reply);
        }).exceptionally(ex -> {
            return ResponseEntity.internalServerError().body("Task failed: " + ex.getMessage());
        });
    }
}

4. The Backend Worker Contract
Your backend workers remain standard, dumb, asynchronous consumers. They just need to respect the routing contract by passing the properties back.

private void processAIRequest(ServiceBusReceivedMessageContext context) {
    ServiceBusReceivedMessage request = context.getMessage();

    // Extract the routing property injected by the Sentinel Gateway
    String replyToInstance = (String) request.getApplicationProperties().get("ReplyToInstance");

    // ... (Simulate slow AI processing taking 45 seconds) ...
    String aiResponse = "Generated Report Data...";

    ServiceBusMessage replyMessage = new ServiceBusMessage(aiResponse);
    replyMessage.setCorrelationId(request.getCorrelationId());

    // CRITICAL: Attach the routing property so Azure knows which pod gets the reply
    if (replyToInstance != null) {
        replyMessage.getApplicationProperties().put("ReplyToInstance", replyToInstance);
    }

    senderClient.sendMessage(replyMessage);
}

The Result

By dropping the Session requirement, your API Gateway layer becomes infinitely horizontally scalable. You can deploy 10 pods or 1,000 pods. The Azure Service Bus handles all the complex routing logic on the broker side, and your legacy clients get their synchronous 200 OK—no matter how long the AI takes to think.

If you are dealing with timeout issues or brittle edge-integration architectures, check out the project on GitHub.

🔗 Sentinel Service Bus Starter on GitHub

I would love to hear your thoughts, feedback, or horror stories about managing Service Bus sessions in the comments below!

Bypassing Azure Service Bus Session Limits: A Sync-over-Async Pattern for Spring Boot

Shivam Saluja — Thu, 12 Mar 2026 10:14:39 +0000

If you have spent a decade building large-scale backend systems, you know that integrating modern, slow-running workloads—like LLM prompts or complex AI tasks—into legacy synchronous architectures is a massive headache.

Standard HTTP REST calls are inherently brittle for this. If an AI model takes 45 seconds to generate a response, your traditional API gateway or HTTP client will likely time out at the 30-second mark. The connection drops, the user gets a 504 Gateway Timeout, and the backend CPU cycles are completely wasted.

The textbook architectural answer is to introduce a message broker to act as a shock absorber. But what if your client-facing frontend requires a synchronous, Request-Reply experience?

You have to build a "Sync-over-Async" bridge. And if you are using Azure Service Bus, doing this at a massive scale exposes a critical bottleneck.

The Problem with Service Bus Sessions

When implementing a Request-Reply pattern on Azure Service Bus, the default recommendation is to use Sessions. You send a message with a specific SessionId, and your consumer locks onto that session to receive the reply.

This approach works beautifully in small systems, but it fails spectacularly at scale for two reasons:

The "Sticky" Bottleneck: Sessions create exclusive locks. If one session has 1,000 messages and another has 10, a consumer gets stuck on the heavy session while other pods sit idle.
Hard Limits: On the Standard tier, you are limited to 1,500 concurrent sessions. If you are scaling to hundreds or thousands of Spring Boot replicas during a massive traffic spike, you will hit a wall.

If you try to bypass sessions by having thousands of replicas listen to a single shared reply queue, you create a "competing consumer" disaster, wasting CPU cycles and thrashing the broker.

The Enterprise Solution: The Filtered Topic Pattern

To build a highly scalable, session-less Request-Reply architecture, we need to shift from Queues to Topics with SQL Filters. This is the core engine of an AI-Native Gateway concept designed to modernize legacy software systems without rewriting the clients.

Here is how the architecture flows:

The Request: The Spring Boot application generates a unique InstanceId on startup. It sends the request to a standard queue, attaching a custom property: ReplyToInstance = 'Instance-123'.
The Dynamic Subscription: When the pod boots up, it dynamically provisions a lightweight Subscription to a global reply-topic.
The Magic (SQL Filter): We apply a SqlRuleFilter to that subscription: ReplyToInstance = 'Instance-123'.

By leveraging the broker's data plane to evaluate the SQL filter, Azure Service Bus does the heavy lifting. Pod #123 only receives messages destined for Pod #123. There is zero thrashing, no session limits, and you get pure horizontal elasticity.

Achieving True Horizontal Scaling with an HTTP Load Balancer

This architecture is not just powerful for one gateway instance; it is designed for massive scale. You can have 50 or 100 Gateway pods sitting behind a load balancer to handle peak traffic.

To do this, you place a standard HTTP Load Balancer (like Azure Application Gateway or Nginx) in front of your Sentinel Gateway instances.

The Load Balancer's role is crucial:

Even Traffic Distribution: Configure the load balancer with a "Round-Robin" or "Least Connections" algorithm. This ensures incoming HTTP requests are sprayed evenly across all available Gateway pods (e.g., Gateway-A, Gateway-B, etc.).
Preventing "Sticky" Bottlenecks: This is critical. You must disable HTTP Session Affinity (Sticky Sessions) on the Load Balancer. Every single request should be routed independently. Because each Gateway instance operates on a strict 1:1 ratio—generating a unique CorrelationID and waiting for exactly one reply—they don't need to share state. An even distribution of HTTP traffic naturally leads to an even distribution of Service Bus messages and replies.

This creates a stateless, highly resilient design. If one Gateway instance crashes, the load balancer simply sends the next request to another instance, and the overall system keeps humming.

Introducing the Sentinel Service Bus Starter

Wiring up the Azure Administration Client to dynamically provision and clean up these filtered subscriptions—while managing reactive CompletableFuture mappings—is a lot of boilerplate.

To solve this, I built the Sentinel Service Bus Starter, a plug-and-play Spring Boot library that abstracts this entire pattern into a single dependency.

How it works:

Just drop the dependency into your build.gradle, provide your connection string in application.yml, and inject the SentinelTemplate:

@RestController
@RequestMapping("/api/v1/gateway")
public class GatewayController {

    private final SentinelTemplate sentinelTemplate;

    public GatewayController(SentinelTemplate sentinelTemplate) {
        this.sentinelTemplate = sentinelTemplate;
    }

    @PostMapping("/process")
    public CompletableFuture<ResponseEntity<String>> processRequest(@RequestBody String payload) {
        // Sends to the ASB Queue, waits on the dynamic Topic Subscription
        return sentinelTemplate.sendAndReceive(payload)
            .thenApply(ResponseEntity::ok)
            .exceptionally(ex -> ResponseEntity.internalServerError().build());
    }
}

Because it leverages Java 21's Virtual Threads (Project Loom) under the hood, Tomcat HTTP threads are never blocked while waiting for the Service Bus round-trip, allowing incredible throughput even when waiting 60 seconds for an AI workload to finish.

Bridging the Legacy Gap

We don't always have the luxury of migrating our entire ecosystem to Event-Driven Architecture overnight. Sometimes, you just need a bulletproof, highly scalable Gateway to protect your modern backends from synchronous legacy clients.

I’d love to hear how other teams are tackling the Sync-over-Async problem in the comments!