<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Ismail C</title>
    <description>The latest articles on Forem by Ismail C (@newssourcecrawler).</description>
    <link>https://forem.com/newssourcecrawler</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3802524%2F31aa372e-0390-4406-a2c6-e2bc02793c29.jpeg</url>
      <title>Forem: Ismail C</title>
      <link>https://forem.com/newssourcecrawler</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/newssourcecrawler"/>
    <language>en</language>
    <item>
      <title>Stop Queuing Inference Requests</title>
      <dc:creator>Ismail C</dc:creator>
      <pubDate>Mon, 02 Mar 2026 22:21:43 +0000</pubDate>
      <link>https://forem.com/newssourcecrawler/stop-queuing-inference-requests-11cj</link>
      <guid>https://forem.com/newssourcecrawler/stop-queuing-inference-requests-11cj</guid>
      <description>&lt;p&gt;Most inference backends degrade under burst.&lt;/p&gt;

&lt;p&gt;This is not specific to LLMs.&lt;br&gt;
It applies to any constrained compute system:&lt;br&gt;
    • a single GPU&lt;br&gt;
    • a local model runner&lt;br&gt;
    • a CPU-bound worker&lt;br&gt;
    • a tightly sized inference fleet&lt;/p&gt;

&lt;p&gt;When demand spikes, most systems do one of two things:&lt;br&gt;
    1.  Accept everything and let requests accumulate internally.&lt;br&gt;
    2.  Rate-limit arrival at the edge.&lt;/p&gt;

&lt;p&gt;Both approaches hide the real problem.&lt;/p&gt;

&lt;p&gt;Queues grow.&lt;br&gt;
Latency stretches.&lt;br&gt;
Retries amplify pressure.&lt;br&gt;
Memory usage becomes unpredictable.&lt;br&gt;
Overload turns opaque.&lt;/p&gt;

&lt;p&gt;You don’t see failure immediately.&lt;br&gt;
You see slow decay.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;The Missing Boundary&lt;/p&gt;

&lt;p&gt;There’s a difference between rate limiting and execution governance.&lt;/p&gt;

&lt;p&gt;Rate limiting controls how fast requests arrive.&lt;br&gt;
Execution governance controls how many requests are allowed to run.&lt;/p&gt;

&lt;p&gt;Those are not the same.&lt;/p&gt;

&lt;p&gt;You can rate-limit and still build an unbounded internal queue.&lt;/p&gt;

&lt;p&gt;If you don’t enforce a hard cap on concurrent execution, the backend becomes the queue.&lt;/p&gt;

&lt;p&gt;And queues under burst are silent liabilities.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;A Different Approach: Explicit Yield&lt;/p&gt;

&lt;p&gt;Instead of buffering overload, convert it into an explicit response.&lt;/p&gt;

&lt;p&gt;When capacity is full:&lt;br&gt;
    • Do not queue.&lt;br&gt;
    • Do not block.&lt;br&gt;
    • Do not defer silently.&lt;/p&gt;

&lt;p&gt;Return:&lt;/p&gt;

&lt;p&gt;status = yield&lt;br&gt;
retry_hint_ms = &lt;/p&gt;

&lt;p&gt;The system remains bounded.&lt;br&gt;
The client decides when to retry.&lt;/p&gt;

&lt;p&gt;Overload becomes explicit instead of hidden.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;What This Looks Like&lt;/p&gt;

&lt;p&gt;Here’s a simple test:&lt;br&gt;
    • max_inflight = 1&lt;br&gt;
    • 20 concurrent clients&lt;br&gt;
    • backend execution time = 10 seconds&lt;/p&gt;

&lt;p&gt;Observed state transitions:&lt;/p&gt;

&lt;p&gt;t=44  inflight=1  executed_total=1  yielded_total=19&lt;br&gt;
t=79  inflight=0  executed_total=1  yielded_total=19&lt;/p&gt;

&lt;p&gt;Interpretation:&lt;br&gt;
    • Inflight never exceeded 1.&lt;br&gt;
    • One request executed.&lt;br&gt;
    • Nineteen yielded immediately.&lt;br&gt;
    • No queue growth.&lt;/p&gt;

&lt;p&gt;The system did not degrade.&lt;br&gt;
It remained bounded.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Why This Matters for Inference Systems&lt;/p&gt;

&lt;p&gt;Inference workloads are bursty.&lt;/p&gt;

&lt;p&gt;Prompts don’t arrive in smooth curves.&lt;br&gt;
They arrive in clusters:&lt;br&gt;
    • user refresh storms&lt;br&gt;
    • retry loops&lt;br&gt;
    • concurrent UI events&lt;br&gt;
    • load balancer reshuffles&lt;br&gt;
    • autoscaler lag&lt;/p&gt;

&lt;p&gt;If your backend silently buffers that burst,&lt;br&gt;
you inherit the tail latency and memory consequences later.&lt;/p&gt;

&lt;p&gt;If you bound execution and yield instead,&lt;br&gt;
you trade implicit instability for explicit backpressure.&lt;/p&gt;

&lt;p&gt;That trade is almost always worth it.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;What This Is Not&lt;/p&gt;

&lt;p&gt;This is not:&lt;br&gt;
    • a scheduler&lt;br&gt;
    • a policy engine&lt;br&gt;
    • a fairness system&lt;br&gt;
    • a gateway&lt;br&gt;
    • a dashboard&lt;br&gt;
    • a distributed runtime&lt;/p&gt;

&lt;p&gt;It is a narrow primitive:&lt;/p&gt;

&lt;p&gt;Hard concurrency cap + explicit yield.&lt;/p&gt;

&lt;p&gt;Nothing more.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;A Small Tool, Intentionally&lt;/p&gt;

&lt;p&gt;I built a small ingress governor around this idea.&lt;/p&gt;

&lt;p&gt;It:&lt;br&gt;
    • accepts newline-delimited JSON frames over TCP&lt;br&gt;
    • validates upload integrity&lt;br&gt;
    • enforces max_inflight&lt;br&gt;
    • returns yield immediately when saturated&lt;br&gt;
    • exposes minimal metrics (inflight, executed_total, yielded_total)&lt;/p&gt;

&lt;p&gt;It does not inspect prompts.&lt;br&gt;
It does not introspect models.&lt;br&gt;
It does not count tokens.&lt;br&gt;
It does not apply policy.&lt;/p&gt;

&lt;p&gt;It governs execution slots.&lt;/p&gt;

&lt;p&gt;That’s all.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Why Not Just Use Nginx?&lt;/p&gt;

&lt;p&gt;Because rate limiting is not execution governance.&lt;/p&gt;

&lt;p&gt;You can limit requests per second and still allow an unbounded number of concurrent backend submissions.&lt;/p&gt;

&lt;p&gt;Bounded concurrency and explicit yield are different primitives.&lt;/p&gt;

&lt;p&gt;They can coexist.&lt;br&gt;
They solve different problems.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;The Core Idea&lt;/p&gt;

&lt;p&gt;Stop treating overload as something to buffer.&lt;/p&gt;

&lt;p&gt;Treat it as something to expose.&lt;/p&gt;

&lt;p&gt;If capacity is full, say so.&lt;/p&gt;

&lt;p&gt;Return yield.&lt;/p&gt;

&lt;p&gt;Remain bounded.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;If you operate constrained compute systems and care about deterministic behavior under burst, this approach may be useful.&lt;/p&gt;

&lt;p&gt;Reference implementation:&lt;br&gt;
&lt;a href="https://github.com/newssourcecrawler/heptamini" rel="noopener noreferrer"&gt;https://github.com/newssourcecrawler/heptamini&lt;/a&gt;&lt;/p&gt;

</description>
      <category>infrastructure</category>
      <category>systems</category>
      <category>backend</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
