<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: tao zeng</title>
    <description>The latest articles on Forem by tao zeng (@tao_zeng_e1cc937be4a6286a).</description>
    <link>https://forem.com/tao_zeng_e1cc937be4a6286a</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3727370%2F315c6a3e-3788-4d42-8381-b559b0f34a82.png</url>
      <title>Forem: tao zeng</title>
      <link>https://forem.com/tao_zeng_e1cc937be4a6286a</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/tao_zeng_e1cc937be4a6286a"/>
    <language>en</language>
    <item>
      <title>It's the AI Era — Why Are You Still Using nginx for Your LLM Gateway?</title>
      <dc:creator>tao zeng</dc:creator>
      <pubDate>Fri, 23 Jan 2026 02:48:05 +0000</pubDate>
      <link>https://forem.com/tao_zeng_e1cc937be4a6286a/its-the-ai-era-why-are-you-still-using-nginx-for-your-llm-gateway-2dka</link>
      <guid>https://forem.com/tao_zeng_e1cc937be4a6286a/its-the-ai-era-why-are-you-still-using-nginx-for-your-llm-gateway-2dka</guid>
      <description>&lt;p&gt;Introduction&lt;/p&gt;

&lt;p&gt;2024 was the year LLM applications exploded. Countless companies began integrating ChatGPT, Claude, Gemini, and other LLM services into their products. As a backend architect, you probably instinctively reached for your old friend nginx — configure an upstream, add a rate-limiting module, write a few lines of Lua script — done!&lt;/p&gt;

&lt;p&gt;But soon you'll discover things aren't that simple:&lt;/p&gt;

&lt;p&gt;•  Streaming responses lag: Users stare at "Thinking..." for 5 seconds before seeing the first character&lt;br&gt;
•  Token metering is inaccurate: SSE streams get truncated by nginx buffering, usage stats don't add up&lt;br&gt;
•  Rate limiting fails: Limiting by request count? A single streaming conversation can run for 30 seconds, and your concurrency explodes&lt;br&gt;
•  Costs spiral out of control: One backend goes down, traffic doesn't intelligently failover, retry errors burn through your quota&lt;/p&gt;

&lt;p&gt;This isn't because nginx is bad — it simply wasn't designed for the LLM era.&lt;/p&gt;

&lt;p&gt;nginx's "Original Sin": General-Purpose Proxy vs. Specialized Use Cases&lt;/p&gt;

&lt;p&gt;nginx was born in 2004, created to solve the C10K problem as a high-performance HTTP server. Its design philosophy is generality:&lt;/p&gt;

&lt;p&gt;•  Static file serving&lt;br&gt;
•  Reverse proxy and load balancing&lt;br&gt;
•  HTTP caching and compression&lt;br&gt;
•  SSL/TLS termination&lt;/p&gt;

&lt;p&gt;But LLM API gateway requirements are completely different:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Zero-Latency Streaming vs. Buffer Optimization&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;nginx buffers upstream responses by default (proxy_buffering on). Even if you disable buffering, its event loop and memory management aren't optimized for "byte-by-byte passthrough." The result:&lt;br&gt;
LLMProxy was designed from day one with TTFT (Time To First Token) optimization as its first principle. It uses a zero-buffer architecture where every SSE event is flushed to the client immediately upon arrival, introducing no additional latency.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Token-Level Metering vs. Request-Level Statistics&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;nginx's logging and statistics unit is the request, but LLM billing is based on tokens:&lt;br&gt;
nginx&lt;br&gt;
To implement token statistics in nginx, you need to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Parse the complete response body with lua-nginx-module&lt;/li&gt;
&lt;li&gt;Extract the usage field from JSON (in streaming responses, it's in the final SSE event)&lt;/li&gt;
&lt;li&gt;Write to a database or call a webhook&lt;/li&gt;
&lt;li&gt;Handle stream disconnections, parsing failures, race conditions...&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This logic is async and out-of-the-box in LLMProxy, supporting database, webhook, and Prometheus reporting methods, with the ability to integrate with authentication systems for quota deduction.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Intelligent Routing vs. Simple Round-Robin&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;nginx's load balancing strategy:&lt;br&gt;
nginx&lt;br&gt;
Looks great, but in reality:&lt;br&gt;
•  api1 has 200ms latency, api2 has 50ms → nginx doesn't know, keeps round-robin&lt;br&gt;
•  api1 returns a 429 rate limit error → nginx marks it as failed, but user quota was already consumed&lt;br&gt;
•  Want to use Claude as a fallback for GPT → nginx's backup only kicks in when ALL primary nodes are down&lt;/p&gt;

&lt;p&gt;LLMProxy's intelligent router can do this:&lt;br&gt;
yaml&lt;br&gt;
Every request dynamically selects the optimal backend based on real-time health, historical latency, and error rates. Failures auto-retry without double-billing.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Modern Authentication vs. Hand-Written Lua Scripts&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To implement "check Redis for API Key + verify quota + IP allowlist" authentication logic in nginx, you need:&lt;br&gt;
lua&lt;br&gt;
LLMProxy's authentication pipeline is declarative:&lt;br&gt;
yaml&lt;br&gt;
Each provider can have custom Lua scripts for fine-grained control, but infrastructure (connection pools, error recovery, performance metrics) is managed by the framework.&lt;/p&gt;

&lt;p&gt;Performance Comparison: Let the Data Speak&lt;/p&gt;

&lt;p&gt;We tested nginx (OpenResty) and LLMProxy on identical hardware in a typical LLM proxy scenario:&lt;br&gt;
| Metric                      | nginx + lua | LLMProxy    | Notes                         |&lt;br&gt;
| --------------------------- | ----------- | ----------- | ----------------------------- |&lt;br&gt;
| &lt;strong&gt;TTFT&lt;/strong&gt;                    | 120-300ms   | 5-15ms      | Time to first token (p50)     |&lt;br&gt;
| &lt;strong&gt;Streaming throughput&lt;/strong&gt;    | ~800 req/s  | ~2400 req/s | 100 concurrent, streaming     |&lt;br&gt;
| &lt;strong&gt;Memory usage&lt;/strong&gt;            | 1.2GB       | 380MB       | Steady state, 10k connections |&lt;br&gt;
| &lt;strong&gt;Token metering accuracy&lt;/strong&gt; | 93%*        | 99.99%      | &lt;em&gt;Requires custom Lua logic    |&lt;br&gt;
| **Health check overhead&lt;/em&gt;*   | ±50ms/req   | &amp;lt;1ms/req    | Backend health probe impact   |&lt;br&gt;
Test environment: 4 cores, 8GB RAM, backend simulating 20ms latency streaming LLM API&lt;/p&gt;

&lt;p&gt;Why is it so much faster?&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go's concurrency model: goroutines + channels are naturally suited for many long-lived connections&lt;/li&gt;
&lt;li&gt;Zero-copy streaming: Read directly from upstream socket → write to downstream, no intermediate buffering&lt;/li&gt;
&lt;li&gt;Dedicated metrics collection: Async goroutines handle metering without blocking the hot path&lt;/li&gt;
&lt;li&gt;Smart connection reuse: Maintains long-lived connection pools to backend LLM APIs, avoiding TLS handshake overhead&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Real-World Scenario: The Painful Story of an AI Customer Service System&lt;/p&gt;

&lt;p&gt;An online education company deployed an AI customer service bot with nginx. It crashed the day after launch:&lt;/p&gt;

&lt;p&gt;Problem 1: Poor streaming experience, users complaining about "lag"&lt;br&gt;
•  nginx config: proxy_buffering off; proxy_cache off;&lt;br&gt;
•  Root cause: nginx's event loop reads at most 8KB at a time, streaming responses get chunked&lt;br&gt;
•  LLMProxy solution: Zero-buffer engine + immediate flush, SSE events delivered to client in &amp;lt;5ms&lt;/p&gt;

&lt;p&gt;Problem 2: Costs exploded, token usage 40% higher than expected&lt;br&gt;
•  nginx symptom: Couldn't accurately track per-session token consumption&lt;br&gt;
•  Investigation revealed: Retry logic had a bug, same request was billed multiple times; usage lost on stream disconnect&lt;br&gt;
•  LLMProxy solution: Async metering with at-least-once semantics, database + webhook dual-write prevents data loss&lt;/p&gt;

&lt;p&gt;Problem 3: Peak hour avalanche, entire site crashed after one API endpoint returned 429&lt;br&gt;
•  nginx behavior: Upstream 429 counted as "failure," triggered circuit breaker; backup node instantly got 10x the load&lt;br&gt;
•  LLMProxy solution: 429 triggers automatic fallback to backup model, with separate rate limiting per backend URL&lt;/p&gt;

&lt;p&gt;Final results:&lt;br&gt;
•  After switching to LLMProxy, P99 latency dropped from 8 seconds to 1.2 seconds&lt;br&gt;
•  Token metering accuracy improved from 85% to 99%+&lt;br&gt;
•  Monthly API costs decreased 30% (reduced invalid retries and quota waste)&lt;/p&gt;

&lt;p&gt;"But I Already Have nginx — Why Switch?"&lt;/p&gt;

&lt;p&gt;This is the most common question. The answer depends on your stage:&lt;/p&gt;

&lt;p&gt;If you're still in POC/MVP stage&lt;br&gt;
•  nginx is fine, don't over-engineer&lt;br&gt;
•  But leave room for upgrades (don't hardcode business logic into nginx.conf)&lt;/p&gt;

&lt;p&gt;If you already have real user traffic&lt;br&gt;
You've probably already encountered these pain points:&lt;br&gt;
•  ✅ Users complaining about slow, laggy responses&lt;br&gt;
•  ✅ Token metering doesn't match, finance can't reconcile the books&lt;br&gt;
•  ✅ Want A/B testing, model fallback, but nginx config is unmaintainable&lt;br&gt;
•  ✅ Monitoring only shows request count/bandwidth, not business metrics (tokens/cost/model distribution)&lt;/p&gt;

&lt;p&gt;At this point, LLMProxy lets you accomplish in half a day what nginx + Lua takes a week to implement.&lt;/p&gt;

&lt;p&gt;If you're building an enterprise LLM platform&lt;br&gt;
You don't need a proxy — you need an LLM gateway:&lt;br&gt;
•  Multi-tenant isolation and quota management&lt;br&gt;
•  Fine-grained cost accounting and billing&lt;br&gt;
•  Compliance auditing (logging all prompts and completions)&lt;br&gt;
•  Model routing and canary deployments&lt;br&gt;
•  Custom request/response transformations&lt;/p&gt;

&lt;p&gt;Can nginx do all this? Theoretically yes, but you'll need:&lt;br&gt;
•  OpenResty + extensive custom Lua modules&lt;br&gt;
•  External services (auth center, quota management, audit logging...)&lt;br&gt;
•  Your own complex configuration and operations toolchain&lt;/p&gt;

&lt;p&gt;LLMProxy provides a complete solution out of the box — you just write YAML config and business logic Lua scripts.&lt;/p&gt;

&lt;p&gt;Quick Start: Migrate from nginx to LLMProxy in 15 Minutes&lt;br&gt;
bash&lt;br&gt;
Production migration checklist:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;[ ] Deploy LLMProxy alongside nginx, route 10% traffic for validation&lt;/li&gt;
&lt;li&gt;[ ] Set up monitoring (Prometheus + Grafana) to compare against nginx metrics&lt;/li&gt;
&lt;li&gt;[ ] Migrate authentication logic (API key validation, quota management)&lt;/li&gt;
&lt;li&gt;[ ] Configure intelligent routing and fallback rules&lt;/li&gt;
&lt;li&gt;[ ] Full cutover, keep nginx as backup&lt;/li&gt;
&lt;li&gt;[ ] One week observation period, then decommission nginx&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Conclusion: Use the Right Tool for the Job&lt;/p&gt;

&lt;p&gt;nginx is great software. It has proven its dominance in web serving over 20 years. But LLM API gateway is an entirely new domain that requires:&lt;/p&gt;

&lt;p&gt;•  Microsecond-level streaming optimization&lt;br&gt;
•  Async token-level metering and cost accounting&lt;br&gt;
•  Business-semantic intelligent routing&lt;br&gt;
•  Flexible multi-tenancy and quota management&lt;/p&gt;

&lt;p&gt;Implementing these requirements with a general-purpose HTTP proxy is like using a screwdriver to hammer nails — theoretically possible, practically painful.&lt;/p&gt;

&lt;p&gt;LLMProxy's mission is to make LLM API gateways simple again:&lt;br&gt;
•  Developers write YAML config, not Lua scripts&lt;br&gt;
•  Ops see business metrics (tokens/cost/models), not just QPS/latency&lt;br&gt;
•  Product managers do canary releases and A/B tests without waiting for engineering sprints&lt;/p&gt;

&lt;p&gt;Project: GitHub - LLMProxy&lt;br&gt;&lt;br&gt;
Documentation: Complete architecture design, configuration reference, best practices&lt;br&gt;&lt;br&gt;
Community: Issues and Discussions welcome for questions and contributions&lt;/p&gt;

&lt;p&gt;In the AI era, your gateway should be AI-native too. What's your choice?&lt;/p&gt;

</description>
      <category>aiops</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
