<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Karan Kumar</title>
    <description>The latest articles on Forem by Karan Kumar (@karan_kumar_f09865ff0efe9).</description>
    <link>https://forem.com/karan_kumar_f09865ff0efe9</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3875206%2F404a5575-852c-4acb-b569-c7343cb4d136.png</url>
      <title>Forem: Karan Kumar</title>
      <link>https://forem.com/karan_kumar_f09865ff0efe9</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/karan_kumar_f09865ff0efe9"/>
    <language>en</language>
    <item>
      <title>Designing Agentic AI: From Simple Prompts to Autonomous Loops</title>
      <dc:creator>Karan Kumar</dc:creator>
      <pubDate>Mon, 13 Apr 2026 16:58:42 +0000</pubDate>
      <link>https://forem.com/karan_kumar_f09865ff0efe9/designing-agentic-ai-from-simple-prompts-to-autonomous-loops-54m2</link>
      <guid>https://forem.com/karan_kumar_f09865ff0efe9/designing-agentic-ai-from-simple-prompts-to-autonomous-loops-54m2</guid>
      <description>&lt;p&gt;Your LLM agent is stuck in an infinite loop. It’s calling the same API tool repeatedly, burning through your token budget, and providing zero value to the user. You try to fix it with a longer system prompt, but that only makes the agent more prone to hallucinating its own tool outputs. &lt;/p&gt;

&lt;p&gt;The reality is that prompt engineering is not a system design strategy. To build autonomous AI agents that actually scale, you need to move beyond the prompt and into architecture.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Challenge: The "Stochasticity Gap"
&lt;/h3&gt;

&lt;p&gt;Building a chatbot is easy; building an agent—a system that can reason, use tools, and correct its own mistakes—is a nightmare. The core problem is the &lt;strong&gt;Stochasticity Gap&lt;/strong&gt;: the distance between the probabilistic nature of an LLM and the deterministic requirements of software engineering.&lt;/p&gt;

&lt;p&gt;In a traditional system, calling &lt;code&gt;getUserData(id)&lt;/code&gt; returns a JSON object or a predictable error. In an agentic system, the LLM might decide to call &lt;code&gt;get_user_data&lt;/code&gt; (wrong casing), pass a string instead of an integer, or simply decide it doesn't need the data at all and invent a plausible-sounding answer.&lt;/p&gt;

&lt;p&gt;When you scale this to thousands of concurrent users, the edge cases explode. You aren't just managing API latency; you're managing "reasoning latency." If an agent requires five steps to solve a problem and each step has a 90% success rate, your overall success rate drops to ~59%. That is not production-ready.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Architecture: The Cognitive Loop
&lt;/h3&gt;

&lt;p&gt;To solve this, we must move away from "one-shot" prompts and toward a state-machine architecture. Instead of treating the LLM as the program itself, treat it as the CPU within a larger system. The system provides the memory, the tools, and the guardrails.&lt;/p&gt;

&lt;p&gt;While many implement a ReAct (Reason + Act) pattern, the key to stability is wrapping it in a controlled execution loop. Rather than letting the LLM run wild, we implement a &lt;strong&gt;"Plan-Execute-Verify"&lt;/strong&gt; cycle.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FJSV7aW5pdDogeyd0aGVtZSc6ICdiYXNlJywgJ3RoZW1lVmFyaWFibGVzJzogeyAncHJpbWFyeUNvbG9yJzogJyNmZjlmMWMnLCAnc2Vjb25kYXJ5Q29sb3InOiAnIzJlYzRiNicsICd0ZXJ0aWFyeUNvbG9yJzogJyNlNzFkMzYnLCAncHJpbWFyeUJvcmRlckNvbG9yJzogJyMwMTE2MjcnLCAnbGluZUNvbG9yJzogJyMwMTE2MjcnLCAnZm9udEZhbWlseSc6ICdJbnRlciwgc2Fucy1zZXJpZid9fX0lJQpncmFwaCBUQgogICAgVXNlcltVc2VyIFJlcXVlc3RdIC0tPiBPcmNoZXN0cmF0b3JbQWdlbnQgT3JjaGVzdHJhdG9yXQogICAgT3JjaGVzdHJhdG9yIC0tPiBQbGFubmVyW1BsYW5uZXI6IERlY29tcG9zZXMgR29hbCBpbnRvIFRhc2tzXQogICAgUGxhbm5lciAtLT4gRXhlY3V0b3JbRXhlY3V0b3I6IFRvb2wgVXNlIC8gQVBJIENhbGxzXQogICAgRXhlY3V0b3IgLS0-IFZlcmlmaWVyW1ZlcmlmaWVyOiBWYWxpZGF0ZXMgT3V0cHV0IGFnYWluc3QgR29hbF0KICAgIFZlcmlmaWVyIC0tICJGYWlsdXJlL0dhcCIgLS0-IFBsYW5uZXIKICAgIFZlcmlmaWVyIC0tICJTdWNjZXNzIiAtLT4gUmVzcG9uc2VbRmluYWwgQW5zd2VyIHRvIFVzZXJdCiAgICAKICAgIHN1YmdyYXBoIE1lbW9yeQogICAgICAgIFNob3J0VGVybVtXb3JraW5nIENvbnRleHQgLyBCdWZmZXJdCiAgICAgICAgTG9uZ1Rlcm1bVmVjdG9yIERCIC8gVXNlciBIaXN0b3J5XQogICAgZW5kCiAgICAKICAgIE9yY2hlc3RyYXRvciA8LS0-IE1lbW9yeQogICAgRXhlY3V0b3IgPC0tPiBNZW1vcnk%3D%3FbgColor%3D%21white" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FJSV7aW5pdDogeyd0aGVtZSc6ICdiYXNlJywgJ3RoZW1lVmFyaWFibGVzJzogeyAncHJpbWFyeUNvbG9yJzogJyNmZjlmMWMnLCAnc2Vjb25kYXJ5Q29sb3InOiAnIzJlYzRiNicsICd0ZXJ0aWFyeUNvbG9yJzogJyNlNzFkMzYnLCAncHJpbWFyeUJvcmRlckNvbG9yJzogJyMwMTE2MjcnLCAnbGluZUNvbG9yJzogJyMwMTE2MjcnLCAnZm9udEZhbWlseSc6ICdJbnRlciwgc2Fucy1zZXJpZid9fX0lJQpncmFwaCBUQgogICAgVXNlcltVc2VyIFJlcXVlc3RdIC0tPiBPcmNoZXN0cmF0b3JbQWdlbnQgT3JjaGVzdHJhdG9yXQogICAgT3JjaGVzdHJhdG9yIC0tPiBQbGFubmVyW1BsYW5uZXI6IERlY29tcG9zZXMgR29hbCBpbnRvIFRhc2tzXQogICAgUGxhbm5lciAtLT4gRXhlY3V0b3JbRXhlY3V0b3I6IFRvb2wgVXNlIC8gQVBJIENhbGxzXQogICAgRXhlY3V0b3IgLS0-IFZlcmlmaWVyW1ZlcmlmaWVyOiBWYWxpZGF0ZXMgT3V0cHV0IGFnYWluc3QgR29hbF0KICAgIFZlcmlmaWVyIC0tICJGYWlsdXJlL0dhcCIgLS0-IFBsYW5uZXIKICAgIFZlcmlmaWVyIC0tICJTdWNjZXNzIiAtLT4gUmVzcG9uc2VbRmluYWwgQW5zd2VyIHRvIFVzZXJdCiAgICAKICAgIHN1YmdyYXBoIE1lbW9yeQogICAgICAgIFNob3J0VGVybVtXb3JraW5nIENvbnRleHQgLyBCdWZmZXJdCiAgICAgICAgTG9uZ1Rlcm1bVmVjdG9yIERCIC8gVXNlciBIaXN0b3J5XQogICAgZW5kCiAgICAKICAgIE9yY2hlc3RyYXRvciA8LS0-IE1lbW9yeQogICAgRXhlY3V0b3IgPC0tPiBNZW1vcnk%3D%3FbgColor%3D%21white" alt="architecture diagram" width="642" height="812"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Core Components: The Agentic Stack
&lt;/h3&gt;

&lt;p&gt;A robust architecture requires more than just an API key; it requires four distinct modules working in concert.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The Planner (The Pre-frontal Cortex)&lt;/strong&gt;&lt;br&gt;
The planner doesn't execute; it strategizes. It takes a complex query (e.g., &lt;em&gt;"Research the last three quarters of Nvidia's earnings and compare them to AMD"&lt;/em&gt;) and breaks it into a Directed Acyclic Graph (DAG) of tasks. This prevents the agent from getting lost in the weeds of a single API call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The Tool Registry (The Hands)&lt;/strong&gt;&lt;br&gt;
Providing an LLM with every available tool creates noise and confusion. Instead, use a dynamic tool registry. Based on the user's intent, the orchestrator injects only the relevant tool definitions into the context window, reducing noise and saving tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The Verifier (The Critic)&lt;/strong&gt;&lt;br&gt;
This is the most overlooked component. The Verifier is a separate, often smaller or more specialized LLM instance (or a set of deterministic rules) that asks: &lt;em&gt;"Does this output actually answer the user's request?"&lt;/em&gt; If the answer is no, it triggers a loop back to the planner.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Memory Management (The Hippocampus)&lt;/strong&gt;&lt;br&gt;
Memory should be split into two tiers. Short-term memory acts as the sliding window of the current conversation. Long-term memory utilizes a Vector Database (such as Pinecone or Milvus) to retrieve relevant documents via RAG (Retrieval-Augmented Generation).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FJSV7aW5pdDogeyd0aGVtZSc6ICdiYXNlJywgJ3RoZW1lVmFyaWFibGVzJzogeyAncHJpbWFyeUNvbG9yJzogJyNmZjlmMWMnLCAnc2Vjb25kYXJ5Q29sb3InOiAnIzJlYzRiNicsICd0ZXJ0aWFyeUNvbG9yJzogJyNlNzFkMzYnLCAncHJpbWFyeUJvcmRlckNvbG9yJzogJyMwMTE2MjcnLCAnbGluZUNvbG9yJzogJyMwMTE2MjcnLCAnZm9udEZhbWlseSc6ICdJbnRlciwgc2Fucy1zZXJpZid9fX0lJQpzZXF1ZW5jZURpYWdyYW0KICAgIHBhcnRpY2lwYW50IFUgYXMgVXNlcgogICAgcGFydGljaXBhbnQgTyBhcyBPcmNoZXN0cmF0b3IKICAgIHBhcnRpY2lwYW50IFAgYXMgUGxhbm5lcgogICAgcGFydGljaXBhbnQgVCBhcyBUb29sL0FQSQogICAgcGFydGljaXBhbnQgViBhcyBWZXJpZmllcgoKICAgIFUtPj5POiAiQW5hbHl6ZSBteSBzcGVuZCBmb3IgUTMiCiAgICBPLT4-UDogR2VuZXJhdGUgVGFzayBMaXN0CiAgICBQLT4-TzogWzEuIEZldGNoIFEzIERhdGEsIDIuIFN1bW1hcml6ZSwgMy4gQ29tcGFyZV0KICAgIE8tPj5UOiBDYWxsIGdldF9zcGVuZChxdWFydGVyPSdRMycpCiAgICBULS0-Pk86IFJldHVybnMgcmF3IEpTT04KICAgIE8tPj5WOiBEb2VzIHRoaXMgSlNPTiBjb250YWluIFEzIHNwZW5kPwogICAgVi0tPj5POiBZZXMKICAgIE8tPj5QOiBOZXh0IHRhc2s6IFN1bW1hcml6ZQogICAgUC0-Pk86IFByb2Nlc3MgZGF0YSBpbnRvIGluc2lnaHRzCiAgICBPLT4-VTogIllvdXIgUTMgc3BlbmQgd2FzLi4uIg%3D%3D%3FbgColor%3D%21white" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FJSV7aW5pdDogeyd0aGVtZSc6ICdiYXNlJywgJ3RoZW1lVmFyaWFibGVzJzogeyAncHJpbWFyeUNvbG9yJzogJyNmZjlmMWMnLCAnc2Vjb25kYXJ5Q29sb3InOiAnIzJlYzRiNicsICd0ZXJ0aWFyeUNvbG9yJzogJyNlNzFkMzYnLCAncHJpbWFyeUJvcmRlckNvbG9yJzogJyMwMTE2MjcnLCAnbGluZUNvbG9yJzogJyMwMTE2MjcnLCAnZm9udEZhbWlseSc6ICdJbnRlciwgc2Fucy1zZXJpZid9fX0lJQpzZXF1ZW5jZURpYWdyYW0KICAgIHBhcnRpY2lwYW50IFUgYXMgVXNlcgogICAgcGFydGljaXBhbnQgTyBhcyBPcmNoZXN0cmF0b3IKICAgIHBhcnRpY2lwYW50IFAgYXMgUGxhbm5lcgogICAgcGFydGljaXBhbnQgVCBhcyBUb29sL0FQSQogICAgcGFydGljaXBhbnQgViBhcyBWZXJpZmllcgoKICAgIFUtPj5POiAiQW5hbHl6ZSBteSBzcGVuZCBmb3IgUTMiCiAgICBPLT4-UDogR2VuZXJhdGUgVGFzayBMaXN0CiAgICBQLT4-TzogWzEuIEZldGNoIFEzIERhdGEsIDIuIFN1bW1hcml6ZSwgMy4gQ29tcGFyZV0KICAgIE8tPj5UOiBDYWxsIGdldF9zcGVuZChxdWFydGVyPSdRMycpCiAgICBULS0-Pk86IFJldHVybnMgcmF3IEpTT04KICAgIE8tPj5WOiBEb2VzIHRoaXMgSlNPTiBjb250YWluIFEzIHNwZW5kPwogICAgVi0tPj5POiBZZXMKICAgIE8tPj5QOiBOZXh0IHRhc2s6IFN1bW1hcml6ZQogICAgUC0-Pk86IFByb2Nlc3MgZGF0YSBpbnRvIGluc2lnaHRzCiAgICBPLT4-VTogIllvdXIgUTMgc3BlbmQgd2FzLi4uIg%3D%3D%3FbgColor%3D%21white" alt="sequence diagram" width="1267" height="623"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Data &amp;amp; Workflow: Handling the "Hallucination Loop"
&lt;/h3&gt;

&lt;p&gt;Data flow in an agentic system is non-linear. The primary risk is the &lt;strong&gt;"Hallucination Loop,"&lt;/strong&gt; where the agent makes a mistake, attempts to fix it by hallucinating a tool output, and then validates that hallucination as true.&lt;/p&gt;

&lt;p&gt;To prevent this, implement &lt;strong&gt;Strict Schema Enforcement&lt;/strong&gt;. Rather than asking the LLM for JSON, force it using constrained sampling (such as Guidance or Outlines). If the LLM outputs &lt;code&gt;{ "amount": "ten dollars" }&lt;/code&gt; when the schema requires an integer, the system rejects the output at the token level before it ever reaches the executor.&lt;/p&gt;

&lt;p&gt;Furthermore, implement a &lt;strong&gt;Human-in-the-Loop (HITL)&lt;/strong&gt; trigger. For high-stakes actions—such as deleting a database or sending a payment—the state machine pauses and emits a &lt;code&gt;PENDING_APPROVAL&lt;/code&gt; event. The agent cannot proceed until a human signs off via a webhook.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trade-offs &amp;amp; Scalability: Latency vs. Reliability
&lt;/h3&gt;

&lt;p&gt;Agentic systems are inherently slower. Each "loop" adds seconds to the response time; if an agent loops four times, the user may stare at a loading spinner for 20 seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Throughput Bottleneck&lt;/strong&gt;&lt;br&gt;
The bottleneck is rarely the database—it is the LLM's Time-To-First-Token (TTFT). To scale, use a tiered model strategy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fast Path:&lt;/strong&gt; A small model (e.g., GPT-4o-mini or Claude Haiku) handles the Verifier and simple tool routing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slow Path:&lt;/strong&gt; A large model (e.g., GPT-4o or Claude 3.5 Sonnet) handles complex Planning and Final Synthesis.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;State Management&lt;/strong&gt;&lt;br&gt;
Agent state cannot be stored in a local variable; if a pod restarts, the agent loses its place in the plan. Use a distributed state store (like Redis) to track the "Agent State Object," which includes the current task index, the history of tool calls, and pending goals.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FJSV7aW5pdDogeyd0aGVtZSc6ICdiYXNlJywgJ3RoZW1lVmFyaWFibGVzJzogeyAncHJpbWFyeUNvbG9yJzogJyNmZjlmMWMnLCAnc2Vjb25kYXJ5Q29sb3InOiAnIzJlYzRiNicsICd0ZXJ0aWFyeUNvbG9yJzogJyNlNzFkMzYnLCAncHJpbWFyeUJvcmRlckNvbG9yJzogJyMwMTE2MjcnLCAnbGluZUNvbG9yJzogJyMwMTE2MjcnLCAnZm9udEZhbWlseSc6ICdJbnRlciwgc2Fucy1zZXJpZid9fX0lJQpncmFwaCBURAogICAgQVtSZXF1ZXN0XSAtLT4gQntDb21wbGV4aXR5P30KICAgIEIgLS0gTG93IC0tPiBDW0RpcmVjdCBMTE0gUmVzcG9uc2VdCiAgICBCIC0tIEhpZ2ggLS0-IERbQWdlbnRpYyBMb29wXQogICAgRCAtLT4gRVtTdGF0ZSBTdG9yZTogUmVkaXNdCiAgICBFIC0tPiBGW0FzeW5jIFdvcmtlcjogQ2VsZXJ5L1RlbXBvcmFsXQogICAgRiAtLT4gR1tMTE0gUGxhbm5pbmddCiAgICBHIC0tPiBIW1Rvb2wgRXhlY3V0aW9uXQogICAgSCAtLT4gSVtWZXJpZmljYXRpb25dCiAgICBJIC0tIEZhaWwgLS0-IEcKICAgIEkgLS0gUGFzcyAtLT4gSltGaW5hbCBSZXNwb25zZV0%3D%3FbgColor%3D%21white" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FJSV7aW5pdDogeyd0aGVtZSc6ICdiYXNlJywgJ3RoZW1lVmFyaWFibGVzJzogeyAncHJpbWFyeUNvbG9yJzogJyNmZjlmMWMnLCAnc2Vjb25kYXJ5Q29sb3InOiAnIzJlYzRiNicsICd0ZXJ0aWFyeUNvbG9yJzogJyNlNzFkMzYnLCAncHJpbWFyeUJvcmRlckNvbG9yJzogJyMwMTE2MjcnLCAnbGluZUNvbG9yJzogJyMwMTE2MjcnLCAnZm9udEZhbWlseSc6ICdJbnRlciwgc2Fucy1zZXJpZid9fX0lJQpncmFwaCBURAogICAgQVtSZXF1ZXN0XSAtLT4gQntDb21wbGV4aXR5P30KICAgIEIgLS0gTG93IC0tPiBDW0RpcmVjdCBMTE0gUmVzcG9uc2VdCiAgICBCIC0tIEhpZ2ggLS0-IERbQWdlbnRpYyBMb29wXQogICAgRCAtLT4gRVtTdGF0ZSBTdG9yZTogUmVkaXNdCiAgICBFIC0tPiBGW0FzeW5jIFdvcmtlcjogQ2VsZXJ5L1RlbXBvcmFsXQogICAgRiAtLT4gR1tMTE0gUGxhbm5pbmddCiAgICBHIC0tPiBIW1Rvb2wgRXhlY3V0aW9uXQogICAgSCAtLT4gSVtWZXJpZmljYXRpb25dCiAgICBJIC0tIEZhaWwgLS0-IEcKICAgIEkgLS0gUGFzcyAtLT4gSltGaW5hbCBSZXNwb25zZV0%3D%3FbgColor%3D%21white" alt="architecture diagram" width="473" height="1057"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stop Prompting, Start Architecting:&lt;/strong&gt; If your logic depends on a "better prompt," your system is fragile. Move the logic into the orchestrator and verifier.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Constraint &amp;gt; Instruction:&lt;/strong&gt; Use constrained sampling to force JSON schemas rather than asking the model to "please output JSON."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tier Your Models:&lt;/strong&gt; Use small, fast models for verification and routing; reserve expensive, slow models for high-level reasoning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State is Everything:&lt;/strong&gt; Treat agentic workflows as long-running distributed transactions. Use a state store (like Redis or Temporal) to ensure resilience across restarts.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>How Epic Games Scales to 100M+ Concurrent Users</title>
      <dc:creator>Karan Kumar</dc:creator>
      <pubDate>Mon, 13 Apr 2026 16:31:31 +0000</pubDate>
      <link>https://forem.com/karan_kumar_f09865ff0efe9/how-epic-games-scales-to-100m-concurrent-users-1k0j</link>
      <guid>https://forem.com/karan_kumar_f09865ff0efe9/how-epic-games-scales-to-100m-concurrent-users-1k0j</guid>
      <description>&lt;p&gt;Your game just launched. A million players flood the servers in ten minutes. Suddenly, your matchmaking service spikes to 100% CPU, the database locks up, and the entire world freezes. This isn't a hypothetical—it's the nightmare scenario for any studio launching a global hit.&lt;/p&gt;

&lt;p&gt;Scaling a game like &lt;em&gt;Fortnite&lt;/em&gt; isn't as simple as adding more servers. It requires managing the state of millions of entities in real-time while ensuring that a player in Tokyo and a player in New York feel like they're in the same room. To achieve this, Epic Games utilizes a hyper-optimized blend of event-driven architecture, distributed state management, and aggressive caching.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Challenge: The "World State" Problem
&lt;/h3&gt;

&lt;p&gt;In a standard CRUD application, if a user updates their profile, you write to a database and the next request reads it. Simple. In a massive multiplayer environment, however, "state" is everything: Where is every player? Who is shooting whom? Which building just collapsed?&lt;/p&gt;

&lt;p&gt;At this scale, developers hit three primary walls:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The Latency Wall:&lt;/strong&gt; Light travels at a finite speed. You cannot rely on a single global database for a fast-paced shooter; if you do, the game will feel like it's playing through molasses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The State Explosion:&lt;/strong&gt; Every single movement is an update. If 100 players move 60 times per second, that's 6,000 updates per second per match. Multiply that by thousands of concurrent matches, and your database becomes an instant bottleneck.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Synchronization Nightmare:&lt;/strong&gt; How do you ensure all players perceive the same event at roughly the same time without crashing the network?&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  The Architecture: A Hybrid Distributed Model
&lt;/h3&gt;

&lt;p&gt;Epic doesn't rely on a single monolithic cluster. Instead, they decouple the &lt;strong&gt;Game World&lt;/strong&gt; (real-time physics and combat) from the &lt;strong&gt;Player Meta-state&lt;/strong&gt; (skins, levels, and friendship lists).&lt;/p&gt;

&lt;p&gt;The Game World resides on regional dedicated servers (DS) to minimize latency. The Meta-state lives in a globally distributed microservices layer. When you enter a match, the DS "checks out" your state from the global service, manages it locally for the duration of the game, and "commits" the changes back once the match ends.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FJSV7aW5pdDogeyd0aGVtZSc6ICdiYXNlJywgJ3RoZW1lVmFyaWFibGVzJzogeyAncHJpbWFyeUNvbG9yJzogJyNmZjlmMWMnLCAnc2Vjb25kYXJ5Q29sb3InOiAnIzJlYzRiNicsICd0ZXJ0aWFyeUNvbG9yJzogJyNlNzFkMzYnLCAncHJpbWFyeUJvcmRlckNvbG9yJzogJyMwMTE2MjcnLCAnbGluZUNvbG9yJzogJyMwMTE2MjcnLCAnZm9udEZhbWlseSc6ICdJbnRlciwgc2Fucy1zZXJpZid9fX0lJQpncmFwaCBUQgogICAgVXNlcltQbGF5ZXIgQ2xpZW50XSAtLT4gTEJbR2xvYmFsIExvYWQgQmFsYW5jZXJdCiAgICBMQiAtLT4gTWF0Y2htYWtpbmdbTWF0Y2htYWtpbmcgU2VydmljZV0KICAgIE1hdGNobWFraW5nIC0tPiBSZWdpb25hbERTW1JlZ2lvbmFsIERlZGljYXRlZCBTZXJ2ZXJdCiAgICBSZWdpb25hbERTIC0tPiBTdGF0ZUNhY2hlW0xvY2FsIFJlZGlzIENhY2hlXQogICAgU3RhdGVDYWNoZSAtLT4gR2xvYmFsREJbKERpc3RyaWJ1dGVkIEdsb2JhbCBEQildCiAgICBSZWdpb25hbERTIC0tPiBFdmVudEJ1c1tFdmVudC1Ecml2ZW4gQnVzL0thZmthXQogICAgRXZlbnRCdXMgLS0-IEFuYWx5dGljc1tBbmFseXRpY3MgJiBMb2dnaW5nXQogICAgRXZlbnRCdXMgLS0-IFJld2FyZHNbUmV3YXJkcy9YUCBTZXJ2aWNlXQ%3D%3D%3FbgColor%3D%21white" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FJSV7aW5pdDogeyd0aGVtZSc6ICdiYXNlJywgJ3RoZW1lVmFyaWFibGVzJzogeyAncHJpbWFyeUNvbG9yJzogJyNmZjlmMWMnLCAnc2Vjb25kYXJ5Q29sb3InOiAnIzJlYzRiNicsICd0ZXJ0aWFyeUNvbG9yJzogJyNlNzFkMzYnLCAncHJpbWFyeUJvcmRlckNvbG9yJzogJyMwMTE2MjcnLCAnbGluZUNvbG9yJzogJyMwMTE2MjcnLCAnZm9udEZhbWlseSc6ICdJbnRlciwgc2Fucy1zZXJpZid9fX0lJQpncmFwaCBUQgogICAgVXNlcltQbGF5ZXIgQ2xpZW50XSAtLT4gTEJbR2xvYmFsIExvYWQgQmFsYW5jZXJdCiAgICBMQiAtLT4gTWF0Y2htYWtpbmdbTWF0Y2htYWtpbmcgU2VydmljZV0KICAgIE1hdGNobWFraW5nIC0tPiBSZWdpb25hbERTW1JlZ2lvbmFsIERlZGljYXRlZCBTZXJ2ZXJdCiAgICBSZWdpb25hbERTIC0tPiBTdGF0ZUNhY2hlW0xvY2FsIFJlZGlzIENhY2hlXQogICAgU3RhdGVDYWNoZSAtLT4gR2xvYmFsREJbKERpc3RyaWJ1dGVkIEdsb2JhbCBEQildCiAgICBSZWdpb25hbERTIC0tPiBFdmVudEJ1c1tFdmVudC1Ecml2ZW4gQnVzL0thZmthXQogICAgRXZlbnRCdXMgLS0-IEFuYWx5dGljc1tBbmFseXRpY3MgJiBMb2dnaW5nXQogICAgRXZlbnRCdXMgLS0-IFJld2FyZHNbUmV3YXJkcy9YUCBTZXJ2aWNlXQ%3D%3D%3FbgColor%3D%21white" alt="architecture diagram" width="686" height="617"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Core Components: The Engine Room
&lt;/h3&gt;

&lt;p&gt;To prevent the system from collapsing under its own weight, Epic employs several critical architectural patterns.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. The Matchmaking Orchestrator
&lt;/h4&gt;

&lt;p&gt;Matchmaking is a classic "bin-packing" problem: you must group players by skill, latency, and platform. Rather than using synchronous requests, Epic uses an asynchronous queue. Players enter a pool, a worker evaluates the best fit, and the system then spins up a dedicated server instance specifically for that group.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Distributed Caching (The Speed Layer)
&lt;/h4&gt;

&lt;p&gt;Direct database hits are forbidden in the "hot path." Every player attribute is cached in a distributed layer (such as Redis). If a player changes their skin, the update hits the cache first, which then asynchronously updates the persistent store. This is "eventual consistency" in action—it doesn't matter if the database is 200ms behind, as long as the player sees their new skin immediately.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Event-Driven Backbone
&lt;/h4&gt;

&lt;p&gt;Not every action requires real-time processing. For example, gaining 50 XP doesn't need to be handled by the game server's main loop. Instead, the server emits an event to a message bus (like Kafka). A separate "Rewards Service" consumes that event and updates the player's level, removing processing overhead from the critical game loop.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FJSV7aW5pdDogeyd0aGVtZSc6ICdiYXNlJywgJ3RoZW1lVmFyaWFibGVzJzogeyAncHJpbWFyeUNvbG9yJzogJyNmZjlmMWMnLCAnc2Vjb25kYXJ5Q29sb3InOiAnIzJlYzRiNicsICd0ZXJ0aWFyeUNvbG9yJzogJyNlNzFkMzYnLCAncHJpbWFyeUJvcmRlckNvbG9yJzogJyMwMTE2MjcnLCAnbGluZUNvbG9yJzogJyMwMTE2MjcnLCAnZm9udEZhbWlseSc6ICdJbnRlciwgc2Fucy1zZXJpZid9fX0lJQpzZXF1ZW5jZURpYWdyYW0KICAgIHBhcnRpY2lwYW50IFAgYXMgUGxheWVyCiAgICBwYXJ0aWNpcGFudCBEUyBhcyBHYW1lIFNlcnZlcgogICAgcGFydGljaXBhbnQgRUIgYXMgRXZlbnQgQnVzCiAgICBwYXJ0aWNpcGFudCBSUyBhcyBSZXdhcmRzIFNlcnZpY2UKICAgIHBhcnRpY2lwYW50IERCIGFzIEdsb2JhbCBEQgoKICAgIFAtPj5EUzogUGVyZm9ybXMgQWN0aW9uIChLaWxsIEVuZW15KQogICAgRFMtPj5EUzogQ2FsY3VsYXRlIFBoeXNpY3MvRGFtYWdlCiAgICBEUy0-PlA6IENvbmZpcm0gSGl0IChMb3cgTGF0ZW5jeSkKICAgIERTLT4-RUI6IEVtaXQgIkVuZW15S2lsbGVkIiBFdmVudAogICAgRUItPj5SUzogVHJpZ2dlciBYUCBDYWxjdWxhdGlvbgogICAgUlMtPj5EQjogVXBkYXRlIFBsYXllciBMZXZlbA%3D%3D%3FbgColor%3D%21white" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FJSV7aW5pdDogeyd0aGVtZSc6ICdiYXNlJywgJ3RoZW1lVmFyaWFibGVzJzogeyAncHJpbWFyeUNvbG9yJzogJyNmZjlmMWMnLCAnc2Vjb25kYXJ5Q29sb3InOiAnIzJlYzRiNicsICd0ZXJ0aWFyeUNvbG9yJzogJyNlNzFkMzYnLCAncHJpbWFyeUJvcmRlckNvbG9yJzogJyMwMTE2MjcnLCAnbGluZUNvbG9yJzogJyMwMTE2MjcnLCAnZm9udEZhbWlseSc6ICdJbnRlciwgc2Fucy1zZXJpZid9fX0lJQpzZXF1ZW5jZURpYWdyYW0KICAgIHBhcnRpY2lwYW50IFAgYXMgUGxheWVyCiAgICBwYXJ0aWNpcGFudCBEUyBhcyBHYW1lIFNlcnZlcgogICAgcGFydGljaXBhbnQgRUIgYXMgRXZlbnQgQnVzCiAgICBwYXJ0aWNpcGFudCBSUyBhcyBSZXdhcmRzIFNlcnZpY2UKICAgIHBhcnRpY2lwYW50IERCIGFzIEdsb2JhbCBEQgoKICAgIFAtPj5EUzogUGVyZm9ybXMgQWN0aW9uIChLaWxsIEVuZW15KQogICAgRFMtPj5EUzogQ2FsY3VsYXRlIFBoeXNpY3MvRGFtYWdlCiAgICBEUy0-PlA6IENvbmZpcm0gSGl0IChMb3cgTGF0ZW5jeSkKICAgIERTLT4-RUI6IEVtaXQgIkVuZW15S2lsbGVkIiBFdmVudAogICAgRUItPj5SUzogVHJpZ2dlciBYUCBDYWxjdWxhdGlvbgogICAgUlMtPj5EQjogVXBkYXRlIFBsYXllciBMZXZlbA%3D%3D%3FbgColor%3D%21white" alt="sequence diagram" width="1180" height="477"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Workflow: From Client to Cloud
&lt;/h3&gt;

&lt;p&gt;Data flows through two distinct lanes: the &lt;strong&gt;Fast Lane&lt;/strong&gt; and the &lt;strong&gt;Reliable Lane&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Fast Lane (UDP/Custom Protocols):&lt;/strong&gt;&lt;br&gt;
Player movement and combat utilize UDP. In this context, we don't care if a single packet is lost; we only care about the &lt;em&gt;most recent&lt;/em&gt; position. If packet #40 is missing, the system doesn't request a retransmission—it simply waits for packet #41. This prevents the "head-of-line blocking" that would otherwise cripple TCP-based games.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Reliable Lane (HTTPS/gRPC):&lt;/strong&gt;&lt;br&gt;
Buying a skin or joining a party utilizes TCP/HTTPS. These transactions must be atomic; you cannot "lose a packet" when a user is spending real money. These requests hit the API Gateway, are authenticated, and are routed to the specific microservice responsible for that domain.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trade-offs and Scalability
&lt;/h3&gt;

&lt;p&gt;No system is perfect. Epic makes specific trade-offs to achieve this level of scale:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consistency vs. Availability (CAP Theorem)&lt;/strong&gt;&lt;br&gt;
Epic prioritizes Availability and Partition Tolerance over strict Consistency. If the global database is slightly out of sync for a few seconds, the game continues to run. This is why you occasionally see a "syncing" spinner when opening your locker—the system is reconciling the local cache with the global source of truth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compute: Static vs. Dynamic Scaling&lt;/strong&gt;&lt;br&gt;
Dedicated servers are compute-heavy and take time to scale. To solve this, Epic uses "warm pools"—pre-provisioned server instances that idle and remain ready to accept a match instantly. This trades higher cloud costs (paying for idle servers) for a superior user experience (zero wait time).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Network Bottlenecks&lt;/strong&gt;&lt;br&gt;
As the number of players in a match grows, the required bandwidth grows quadratically (

&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;O(n2)O(n^2)&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;O&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;n&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
) because every player needs to know the location of every other player. To mitigate this, they use &lt;strong&gt;Interest Management&lt;/strong&gt;. The server only sends updates about entities within a certain radius of the player. If a fight is happening 2km away, your client doesn't need the exact coordinates of every bullet—only that "something is happening" in that direction.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FJSV7aW5pdDogeyd0aGVtZSc6ICdiYXNlJywgJ3RoZW1lVmFyaWFibGVzJzogeyAncHJpbWFyeUNvbG9yJzogJyNmZjlmMWMnLCAnc2Vjb25kYXJ5Q29sb3InOiAnIzJlYzRiNicsICd0ZXJ0aWFyeUNvbG9yJzogJyNlNzFkMzYnLCAncHJpbWFyeUJvcmRlckNvbG9yJzogJyMwMTE2MjcnLCAnbGluZUNvbG9yJzogJyMwMTE2MjcnLCAnZm9udEZhbWlseSc6ICdJbnRlciwgc2Fucy1zZXJpZid9fX0lJQpncmFwaCBURAogICAgQVtBbGwgR2FtZSBFbnRpdGllc10gLS0-IEJ7SW4gUGxheWVyIFJhZGl1cz99CiAgICBCIC0tIFllcyAtLT4gQ1tTZW5kIEhpZ2gtRnJlcXVlbmN5IFVwZGF0ZXNdCiAgICBCIC0tIE5vIC0tPiBEW1NlbmQgTG93LUZyZXF1ZW5jeS9ObyBVcGRhdGVzXQogICAgQyAtLT4gRVtDbGllbnQgUmVuZGVycyBTbW9vdGhseV0KICAgIEQgLS0-IEZbQ2xpZW50IElnbm9yZXMvSW50ZXJwb2xhdGVzXQ%3D%3D%3FbgColor%3D%21white" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FJSV7aW5pdDogeyd0aGVtZSc6ICdiYXNlJywgJ3RoZW1lVmFyaWFibGVzJzogeyAncHJpbWFyeUNvbG9yJzogJyNmZjlmMWMnLCAnc2Vjb25kYXJ5Q29sb3InOiAnIzJlYzRiNicsICd0ZXJ0aWFyeUNvbG9yJzogJyNlNzFkMzYnLCAncHJpbWFyeUJvcmRlckNvbG9yJzogJyMwMTE2MjcnLCAnbGluZUNvbG9yJzogJyMwMTE2MjcnLCAnZm9udEZhbWlseSc6ICdJbnRlciwgc2Fucy1zZXJpZid9fX0lJQpncmFwaCBURAogICAgQVtBbGwgR2FtZSBFbnRpdGllc10gLS0-IEJ7SW4gUGxheWVyIFJhZGl1cz99CiAgICBCIC0tIFllcyAtLT4gQ1tTZW5kIEhpZ2gtRnJlcXVlbmN5IFVwZGF0ZXNdCiAgICBCIC0tIE5vIC0tPiBEW1NlbmQgTG93LUZyZXF1ZW5jeS9ObyBVcGRhdGVzXQogICAgQyAtLT4gRVtDbGllbnQgUmVuZGVycyBTbW9vdGhseV0KICAgIEQgLS0-IEZbQ2xpZW50IElnbm9yZXMvSW50ZXJwb2xhdGVzXQ%3D%3D%3FbgColor%3D%21white" alt="architecture diagram" width="583" height="544"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Decouple Real-time from Meta-state:&lt;/strong&gt; Keep your physics loop separate from your database updates. Use regional servers for speed and global services for persistence.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Embrace Eventual Consistency:&lt;/strong&gt; Use a message bus for non-critical updates (XP, achievements, logs) to keep the main execution thread lean.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;UDP for Speed, TCP for Truth:&lt;/strong&gt; Use the right protocol for the right job. Don't let a lost movement packet stall your entire network stream.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Interest Management is Mandatory:&lt;/strong&gt; Don't broadcast the entire world state to every client. Filter data based on what the user actually needs to see.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Warm Pools &amp;gt; Cold Starts:&lt;/strong&gt; In high-scale gaming, the cost of idle compute is lower than the cost of a player leaving because the match took too long to load.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>distributedsystems</category>
      <category>gamedev</category>
      <category>performance</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Scaling Vector Databases: How to Handle Billions of Embeddings</title>
      <dc:creator>Karan Kumar</dc:creator>
      <pubDate>Mon, 13 Apr 2026 15:43:57 +0000</pubDate>
      <link>https://forem.com/karan_kumar_f09865ff0efe9/scaling-vector-databases-how-to-handle-billions-of-embeddings-393d</link>
      <guid>https://forem.com/karan_kumar_f09865ff0efe9/scaling-vector-databases-how-to-handle-billions-of-embeddings-393d</guid>
      <description>&lt;p&gt;Your RAG application works perfectly with 1,000 documents. You push it to production, upload 10 million vectors, and suddenly your query latency jumps from 50ms to 5 seconds. You try adding more RAM, but the index doesn't fit in memory, and your system crashes under the pressure of a simple k-NN search. &lt;/p&gt;

&lt;p&gt;Why do traditional databases fail at this scale? And more importantly, how do you build a vector engine that doesn't?&lt;/p&gt;

&lt;h3&gt;
  
  
  The Challenge: The Curse of Dimensionality
&lt;/h3&gt;

&lt;p&gt;Searching for a string in a B-Tree index is straightforward: you follow a path, find the leaf, and you're done. Vector search is a different beast entirely. We aren't looking for an exact match; we're searching for the "nearest neighbor" in a high-dimensional space (often 768 or 1536 dimensions).&lt;/p&gt;

&lt;p&gt;If you perform a brute-force linear scan (a &lt;strong&gt;Flat index&lt;/strong&gt;), you must calculate the distance between your query vector and every single vector in your database. At 10 million vectors, that is 10 million dot-product calculations per request. This simply does not scale.&lt;/p&gt;

&lt;p&gt;To solve this, we use &lt;strong&gt;Approximate Nearest Neighbor (ANN)&lt;/strong&gt; algorithms. The trade-off is simple: we sacrifice a tiny bit of accuracy (recall) for a massive boost in speed. However, implementing ANN at scale introduces a new challenge: index management. &lt;/p&gt;

&lt;p&gt;When you add new data, the index must be updated. If you rebuild the index from scratch every time, your system becomes effectively read-only during the update. If you update it incrementally, index quality degrades, and search accuracy plummets.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Architecture: Decoupling Storage from Compute
&lt;/h3&gt;

&lt;p&gt;To solve the "update vs. search" paradox, a world-class vector database separates the storage layer from the indexing layer. A vector index cannot be treated like a standard row in Postgres; it is a massive, interconnected graph or a set of clustered centroids that must reside in memory for performance but persist on disk for durability.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9hnn9gon6qjue0mbc7ge.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9hnn9gon6qjue0mbc7ge.png" alt="Detailed description" width="688" height="586"&gt;&lt;/a&gt;&lt;br&gt;Figure 1: High-level overview of the indexing workflow.
  &lt;/p&gt;

&lt;p&gt;In this architecture, the &lt;strong&gt;Query Service&lt;/strong&gt; is optimized for read-heavy workloads, pulling the index into RAM to perform ANN searches. The &lt;strong&gt;Index Service&lt;/strong&gt; handles the heavy lifting of partitioning data and building index structures. By utilizing a Write-Ahead Log (WAL) and an object store (such as S3), we ensure that if a node crashes, the index can be reconstructed without losing a single embedding.&lt;/p&gt;

&lt;h3&gt;
  
  
  Core Components: The Engine Room
&lt;/h3&gt;

&lt;p&gt;To achieve this level of performance, three specific modules must work in harmony: the Indexer, the Segment Manager, and the Metadata Filter.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. The Indexer (HNSW vs. IVF)
&lt;/h4&gt;

&lt;p&gt;Most production systems rely on &lt;strong&gt;HNSW (Hierarchical Navigable Small World)&lt;/strong&gt;. Think of HNSW as a "skip-list" for vectors. It creates a multi-layered graph where the top layers act as "express lanes," allowing the search to jump across the vector space quickly. As the search moves down the layers, the graph becomes denser, allowing the system to hone in on the exact nearest neighbor.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. The Segment Manager
&lt;/h4&gt;

&lt;p&gt;Maintaining one giant index is risky and slow to update. Instead, data is broken into &lt;strong&gt;segments&lt;/strong&gt;—each acting as its own mini-index. When a segment becomes too large, it is merged with others (similar to how an LSM-tree works in RocksDB). This prevents index degradation and enables parallel searching across multiple segments.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. The Metadata Filter
&lt;/h4&gt;

&lt;p&gt;Vector search is rarely about vectors alone. Usually, you need "the most similar document &lt;em&gt;where&lt;/em&gt; &lt;code&gt;user_id = 123&lt;/code&gt; and &lt;code&gt;date &amp;gt; 2023&lt;/code&gt;." &lt;/p&gt;

&lt;p&gt;Performing this as a &lt;strong&gt;post-filter&lt;/strong&gt; (searching vectors first, then filtering) is inefficient; the top 100 vectors might all be filtered out, leaving you with zero results. The gold standard is &lt;strong&gt;pre-filtering&lt;/strong&gt;, where metadata constraints are applied during the graph traversal itself.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F33llbe06gf26r1azung5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F33llbe06gf26r1azung5.png" alt="Abstract illustration of high-dimensional vector space scaling" width="694" height="337"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Workflow: From Embedding to Result
&lt;/h3&gt;

&lt;p&gt;Data movement in a vector database is not a straight line; it is a cycle of ingestion and optimization.&lt;/p&gt;

&lt;p&gt;First, raw text is processed by an embedding model (such as &lt;code&gt;text-embedding-3-small&lt;/code&gt;) to create a vector. This vector is sent to the Index Service and written to the WAL for safety.&lt;/p&gt;

&lt;p&gt;To avoid the latency of updating the HNSW graph immediately, the vector is first placed in a &lt;strong&gt;buffer&lt;/strong&gt; (a small, flat index). Once the buffer reaches a specific threshold, the system triggers a background job to build a new HNSW segment, which is then pushed to the Query Service nodes.&lt;/p&gt;

&lt;p&gt;When a query arrives, the system searches both the optimized HNSW segments and the small flat buffer. This ensures that data is searchable almost instantly (low ingestion latency) while maintaining the speed of graph-based search (low query latency).&lt;/p&gt;

&lt;h3&gt;
  
  
  Trade-offs &amp;amp; Scalability
&lt;/h3&gt;

&lt;p&gt;Scaling a vector database is a balancing act between three variables: &lt;strong&gt;Latency, Recall, and Memory.&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  The Memory Wall
&lt;/h4&gt;

&lt;p&gt;HNSW indices are memory-intensive. If you have 1 billion 1536-dimensional vectors, you will need terabytes of RAM. To mitigate this, we use &lt;strong&gt;Product Quantization (PQ)&lt;/strong&gt;. PQ compresses vectors by splitting them into sub-vectors and clustering them, essentially storing a "codebook" and a short code for each vector. This can reduce memory usage by up to 90%, though it does result in a drop in recall (accuracy).&lt;/p&gt;

&lt;h4&gt;
  
  
  Latency vs. Throughput
&lt;/h4&gt;

&lt;p&gt;To increase throughput, you must shard your data. This can be done by &lt;code&gt;tenant_id&lt;/code&gt; (for multi-tenant apps) or via random sharding. In a random sharding setup, a query is sent to every shard, and the Query Service aggregates the top results—a pattern known as &lt;strong&gt;"scatter-gather."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk7qdgcxj9s20xrztnn7n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk7qdgcxj9s20xrztnn7n.png" alt="Diagram of the scatter-gather search pattern showing query distribution and aggregation" width="696" height="335"&gt;&lt;/a&gt;&lt;br&gt;The "Scatter-Gather" pattern for distributed vector search.
  &lt;/p&gt;

&lt;p&gt;If you require lower latency, you can increase the &lt;code&gt;efConstruction&lt;/code&gt; and &lt;code&gt;efSearch&lt;/code&gt; parameters in HNSW. This makes the search more thorough (higher recall) but slower. It is a sliding scale: do you want the absolute best answer in 200ms, or a "good enough" answer in 20ms?&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Avoid Flat indices in production:&lt;/strong&gt; Use HNSW for the best balance of speed and recall, but plan for the memory overhead.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Decouple Storage and Compute:&lt;/strong&gt; Use a WAL and object store to ensure indices are durable and can be rebuilt without downtime.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Prioritize Pre-filtering:&lt;/strong&gt; Implement pre-filtering via a metadata store to avoid the "empty result set" problem.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Compress to Scale:&lt;/strong&gt; Use Product Quantization (PQ) when your dataset exceeds your RAM budget, but carefully measure the impact on recall.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>systemdesign</category>
      <category>vectordatabases</category>
      <category>machinelearning</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>Agentic ML: Moving from Manual Pipelines to Autonomous AI</title>
      <dc:creator>Karan Kumar</dc:creator>
      <pubDate>Mon, 13 Apr 2026 10:24:52 +0000</pubDate>
      <link>https://forem.com/karan_kumar_f09865ff0efe9/agentic-ml-moving-from-manual-pipelines-to-autonomous-ai-e32</link>
      <guid>https://forem.com/karan_kumar_f09865ff0efe9/agentic-ml-moving-from-manual-pipelines-to-autonomous-ai-e32</guid>
      <description>&lt;p&gt;Your data scientists spend 80% of their time writing boilerplate for feature engineering, debugging CUDA drivers, and stitching together disparate APIs. The actual "science"—the modeling and insight—is a tiny fraction of the workday. This is the "ML Tax," and it is the primary reason most production models never leave the notebook.&lt;/p&gt;

&lt;p&gt;For the last decade, we have built MLOps to manage this complexity. However, we haven't solved the problem; we have simply given it a name and a set of tools. The real shift isn't better orchestration—it is moving from &lt;em&gt;manual pipelines&lt;/em&gt; to &lt;em&gt;agentic workflows&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Challenge: The "Context Switch" Death Spiral
&lt;/h3&gt;

&lt;p&gt;At scale, the ML lifecycle is a fragmented nightmare. Data lives in a warehouse, training scripts reside in a notebook, orchestration is handled by a DAG (like Airflow), and inference runs on a separate Kubernetes cluster.&lt;/p&gt;

&lt;p&gt;Every time a data scientist wants to test a new hypothesis, they hit a wall of friction:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Data Gravity:&lt;/strong&gt; Moving terabytes of data from the warehouse to the training environment is slow, cumbersome, and risky.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure Friction:&lt;/strong&gt; Tuning hyperparameters or configuring distributed training requires deep DevOps knowledge, not just ML expertise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Feedback Loop:&lt;/strong&gt; Identifying why a model is underperforming usually involves manually grepping logs and visualizing feature importance in a separate, disconnected tool.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When your environment is fragmented, the cost of experimentation skyrockets. You stop taking risks. You stop iterating. Your models stagnate.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Architecture: The Agentic AI Data Cloud
&lt;/h3&gt;

&lt;p&gt;To eliminate the ML Tax, we must collapse the distance between the data and the compute. The modern solution is an &lt;strong&gt;Agentic ML Layer&lt;/strong&gt; that sits directly on top of a governed data cloud.&lt;/p&gt;

&lt;p&gt;Instead of you writing the code to move data, an agent—possessing full awareness of your schema, permissions, and compute resources—writes and executes the pipeline for you. It doesn't just suggest code; it reasons through the entire ML lifecycle.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FJSV7aW5pdDogeyd0aGVtZSc6ICdiYXNlJywgJ3RoZW1lVmFyaWFibGVzJzogeyAncHJpbWFyeUNvbG9yJzogJyNmZjlmMWMnLCAnc2Vjb25kYXJ5Q29sb3InOiAnIzJlYzRiNicsICd0ZXJ0aWFyeUNvbG9yJzogJyNlNzFkMzYnLCAncHJpbWFyeUJvcmRlckNvbG9yJzogJyMwMTE2MjcnLCAnbGluZUNvbG9yJzogJyMwMTE2MjcnLCAnZm9udEZhbWlseSc6ICdJbnRlciwgc2Fucy1zZXJpZid9fX0lJQpncmFwaCBUQgogICAgVXNlcltVc2VyOiBOYXR1cmFsIExhbmd1YWdlIFByb21wdF0gLS0-IEFnZW50W0NvcnRleCBDb2RlIEFnZW50XQogICAgc3ViZ3JhcGggRGF0YUNsb3VkIFtBSSBEYXRhIENsb3VkXQogICAgICAgIEFnZW50IC0tPiBQbGFubmVyW1JlYXNvbmluZyAmIFBsYW5uaW5nIEVuZ2luZV0KICAgICAgICBQbGFubmVyIC0tPiBTa2lsbHNbTUwgU2tpbGwgTGlicmFyeTogRmVhdHVyZSBFbmcsIFR1bmluZywgVHJhaW5pbmddCiAgICAgICAgU2tpbGxzIC0tPiBDb21wdXRlW1VuaWZpZWQgQ29tcHV0ZTogQ1BVL0dQVSBDbHVzdGVyc10KICAgICAgICBDb21wdXRlIC0tPiBEYXRhW0dvdmVybmVkIERhdGEgTGFrZS9XYXJlaG91c2VdCiAgICAgICAgRGF0YSAtLT4gQ29tcHV0ZQogICAgZW5kCiAgICBDb21wdXRlIC0tPiBFdmFsW1BlcmZvcm1hbmNlIEV2YWx1YXRpb25dCiAgICBFdmFsIC0tPiBBZ2VudA%3D%3D%3FbgColor%3D%21white" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FJSV7aW5pdDogeyd0aGVtZSc6ICdiYXNlJywgJ3RoZW1lVmFyaWFibGVzJzogeyAncHJpbWFyeUNvbG9yJzogJyNmZjlmMWMnLCAnc2Vjb25kYXJ5Q29sb3InOiAnIzJlYzRiNicsICd0ZXJ0aWFyeUNvbG9yJzogJyNlNzFkMzYnLCAncHJpbWFyeUJvcmRlckNvbG9yJzogJyMwMTE2MjcnLCAnbGluZUNvbG9yJzogJyMwMTE2MjcnLCAnZm9udEZhbWlseSc6ICdJbnRlciwgc2Fucy1zZXJpZid9fX0lJQpncmFwaCBUQgogICAgVXNlcltVc2VyOiBOYXR1cmFsIExhbmd1YWdlIFByb21wdF0gLS0-IEFnZW50W0NvcnRleCBDb2RlIEFnZW50XQogICAgc3ViZ3JhcGggRGF0YUNsb3VkIFtBSSBEYXRhIENsb3VkXQogICAgICAgIEFnZW50IC0tPiBQbGFubmVyW1JlYXNvbmluZyAmIFBsYW5uaW5nIEVuZ2luZV0KICAgICAgICBQbGFubmVyIC0tPiBTa2lsbHNbTUwgU2tpbGwgTGlicmFyeTogRmVhdHVyZSBFbmcsIFR1bmluZywgVHJhaW5pbmddCiAgICAgICAgU2tpbGxzIC0tPiBDb21wdXRlW1VuaWZpZWQgQ29tcHV0ZTogQ1BVL0dQVSBDbHVzdGVyc10KICAgICAgICBDb21wdXRlIC0tPiBEYXRhW0dvdmVybmVkIERhdGEgTGFrZS9XYXJlaG91c2VdCiAgICAgICAgRGF0YSAtLT4gQ29tcHV0ZQogICAgZW5kCiAgICBDb21wdXRlIC0tPiBFdmFsW1BlcmZvcm1hbmNlIEV2YWx1YXRpb25dCiAgICBFdmFsIC0tPiBBZ2VudA%3D%3D%3FbgColor%3D%21white" alt="architecture diagram" width="669" height="736"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this architecture, the agent acts as the orchestrator. It doesn't just generate a Python snippet; it manages the state of the entire pipeline. If training fails due to an OOM (Out of Memory) error, the agent doesn't just report the failure—it analyzes the memory profile and automatically adjusts the distributed training configuration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Core Components: The "Brain" and the "Hands"
&lt;/h3&gt;

&lt;p&gt;An agentic ML system is split into two primary components: the &lt;strong&gt;Reasoning Engine&lt;/strong&gt; and the &lt;strong&gt;Skill Set&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The Reasoning Engine (The Brain)&lt;/strong&gt;&lt;br&gt;
This is the LLM-driven core that translates a high-level request, such as &lt;em&gt;"I want to predict customer churn for Q3,"&lt;/em&gt; into a series of technical steps. It performs a dependency analysis: &lt;em&gt;Do I have the labels? Are there nulls in the features? Which model architecture best fits this data size?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The Skill Library (The Hands)&lt;/strong&gt;&lt;br&gt;
An LLM alone is just a chatbot. To function as an agent, it needs specialized tools. These are pre-built, optimized modules for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automated Feature Engineering:&lt;/strong&gt; Identifying redundant features and suggesting new ones based on data distributions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hyperparameter Optimization (HPO):&lt;/strong&gt; Running distributed sweeps across GPU clusters without requiring the user to manually configure the grid.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distributed Training:&lt;/strong&gt; Managing the complexity of sharding models across multiple nodes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FJSV7aW5pdDogeyd0aGVtZSc6ICdiYXNlJywgJ3RoZW1lVmFyaWFibGVzJzogeyAncHJpbWFyeUNvbG9yJzogJyNmZjlmMWMnLCAnc2Vjb25kYXJ5Q29sb3InOiAnIzJlYzRiNicsICd0ZXJ0aWFyeUNvbG9yJzogJyNlNzFkMzYnLCAncHJpbWFyeUJvcmRlckNvbG9yJzogJyMwMTE2MjcnLCAnbGluZUNvbG9yJzogJyMwMTE2MjcnLCAnZm9udEZhbWlseSc6ICdJbnRlciwgc2Fucy1zZXJpZid9fX0lJQpncmFwaCBURAogICAgUHJvbXB0WyJQcmVkaWN0IENodXJuIl0gLS0-IFBsYW5bUGxhbjogRGF0YSBQcmVwIC0-IFRyYWluIC0-IEV2YWxdCiAgICBQbGFuIC0tPiBTdGVwMVtTa2lsbDogRmVhdHVyZSBFbmdpbmVlcmluZ10KICAgIFN0ZXAxIC0tPiBTdGVwMltTa2lsbDogTW9kZWwgU2VsZWN0aW9uXQogICAgU3RlcDIgLS0-IFN0ZXAzW1NraWxsOiBEaXN0cmlidXRlZCBUcmFpbmluZ10KICAgIFN0ZXAzIC0tPiBTdGVwNFtTa2lsbDogUGVyZm9ybWFuY2UgTW9uaXRvcmluZ10KICAgIFN0ZXA0IC0tPiBGZWVkYmFja3tNZWV0cyBLUEk_fQogICAgRmVlZGJhY2sgLS0gTm8gLS0-IFBsYW4KICAgIEZlZWRiYWNrIC0tIFllcyAtLT4gRGVwbG95W1Byb2R1Y3Rpb24gSW5mZXJlbmNlXQ%3D%3D%3FbgColor%3D%21white" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FJSV7aW5pdDogeyd0aGVtZSc6ICdiYXNlJywgJ3RoZW1lVmFyaWFibGVzJzogeyAncHJpbWFyeUNvbG9yJzogJyNmZjlmMWMnLCAnc2Vjb25kYXJ5Q29sb3InOiAnIzJlYzRiNicsICd0ZXJ0aWFyeUNvbG9yJzogJyNlNzFkMzYnLCAncHJpbWFyeUJvcmRlckNvbG9yJzogJyMwMTE2MjcnLCAnbGluZUNvbG9yJzogJyMwMTE2MjcnLCAnZm9udEZhbWlseSc6ICdJbnRlciwgc2Fucy1zZXJpZid9fX0lJQpncmFwaCBURAogICAgUHJvbXB0WyJQcmVkaWN0IENodXJuIl0gLS0-IFBsYW5bUGxhbjogRGF0YSBQcmVwIC0-IFRyYWluIC0-IEV2YWxdCiAgICBQbGFuIC0tPiBTdGVwMVtTa2lsbDogRmVhdHVyZSBFbmdpbmVlcmluZ10KICAgIFN0ZXAxIC0tPiBTdGVwMltTa2lsbDogTW9kZWwgU2VsZWN0aW9uXQogICAgU3RlcDIgLS0-IFN0ZXAzW1NraWxsOiBEaXN0cmlidXRlZCBUcmFpbmluZ10KICAgIFN0ZXAzIC0tPiBTdGVwNFtTa2lsbDogUGVyZm9ybWFuY2UgTW9uaXRvcmluZ10KICAgIFN0ZXA0IC0tPiBGZWVkYmFja3tNZWV0cyBLUEk_fQogICAgRmVlZGJhY2sgLS0gTm8gLS0-IFBsYW4KICAgIEZlZWRiYWNrIC0tIFllcyAtLT4gRGVwbG95W1Byb2R1Y3Rpb24gSW5mZXJlbmNlXQ%3D%3D%3FbgColor%3D%21white" alt="architecture diagram" width="356" height="946"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Data &amp;amp; Workflow: Closing the Loop
&lt;/h3&gt;

&lt;p&gt;In a traditional workflow, data flows from: &lt;strong&gt;Warehouse 

&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;→\rightarrow&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;→&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 CSV/Parquet 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;→\rightarrow&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;→&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 Training Script 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;→\rightarrow&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;→&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 Model Registry.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In an agentic workflow, the data never leaves the governed perimeter. The agent triggers compute &lt;em&gt;inside&lt;/em&gt; the data cloud.&lt;/p&gt;

&lt;p&gt;Consider a fraud detection use case. The agent doesn't just write a &lt;code&gt;SELECT&lt;/code&gt; statement. It analyzes transaction patterns, identifies that the model is failing on high-frequency, small-value transactions, and autonomously proposes a new feature—perhaps a rolling 10-minute window count—to capture that signal. It then implements the feature, retrains the model, and presents the resulting lift in precision and recall to the engineer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trade-offs &amp;amp; Scalability
&lt;/h3&gt;

&lt;p&gt;Moving to an agentic system is not a magic bullet; there are real engineering trade-offs to consider.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency vs. Throughput&lt;/strong&gt;&lt;br&gt;
Agentic loops introduce "reasoning overhead." An LLM taking five seconds to decide which skill to call is negligible for a training pipeline that takes four hours, but it is unacceptable for real-time inference. This is why the &lt;strong&gt;Agent&lt;/strong&gt; is used for &lt;em&gt;development&lt;/em&gt; (the control plane), while the &lt;strong&gt;Compiled Model&lt;/strong&gt; is used for &lt;em&gt;production&lt;/em&gt; (the data plane).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The "Black Box" Problem&lt;/strong&gt;&lt;br&gt;
When an agent automates feature engineering, visibility can decrease. To solve this, the system must provide a comprehensive audit trail—essentially a "Chain of Thought" log—showing exactly why a specific feature was dropped or why a specific hyperparameter was chosen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compute Efficiency&lt;/strong&gt;&lt;br&gt;
Running LLM-driven agents on top of GPU training is expensive. However, by optimizing the underlying libraries (e.g., using specialized XGBoost implementations), you can achieve inference speeds 10x faster than legacy cloud providers, effectively offsetting the cost of agentic orchestration.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FJSV7aW5pdDogeyd0aGVtZSc6ICdiYXNlJywgJ3RoZW1lVmFyaWFibGVzJzogeyAncHJpbWFyeUNvbG9yJzogJyNmZjlmMWMnLCAnc2Vjb25kYXJ5Q29sb3InOiAnIzJlYzRiNicsICd0ZXJ0aWFyeUNvbG9yJzogJyNlNzFkMzYnLCAncHJpbWFyeUJvcmRlckNvbG9yJzogJyMwMTE2MjcnLCAnbGluZUNvbG9yJzogJyMwMTE2MjcnLCAnZm9udEZhbWlseSc6ICdJbnRlciwgc2Fucy1zZXJpZid9fX0lJQpzZXF1ZW5jZURpYWdyYW0KICAgIHBhcnRpY2lwYW50IEVuZyBhcyBFbmdpbmVlcgogICAgcGFydGljaXBhbnQgQWd0IGFzIENvcnRleCBBZ2VudAogICAgcGFydGljaXBhbnQgQ29tcCBhcyBHUFUgQ2x1c3RlcgogICAgcGFydGljaXBhbnQgRGF0YSBhcyBEYXRhIFdhcmVob3VzZQoKICAgIEVuZy0-PkFndDogIk9wdGltaXplIHRoaXMgY2h1cm4gbW9kZWwiCiAgICBBZ3QtPj5EYXRhOiBBbmFseXplIEZlYXR1cmUgSW1wb3J0YW5jZQogICAgRGF0YS0tPj5BZ3Q6IEZlYXR1cmUgWCBpcyByZWR1bmRhbnQKICAgIEFndC0-PkNvbXA6IFRyaWdnZXIgRGlzdHJpYnV0ZWQgUmV0cmFpbiAod2l0aG91dCBGZWF0dXJlIFgpCiAgICBDb21wLT4-RGF0YTogUHVsbCBPcHRpbWl6ZWQgRGF0YXNldAogICAgQ29tcC0tPj5BZ3Q6IE5ldyBBY2N1cmFjeTogKzIuNCUKICAgIEFndC0-PkVuZzogIlJlbW92ZWQgRmVhdHVyZSBYLCBBY2N1cmFjeSBpbXByb3ZlZCBieSAyLjQlIg%3D%3D%3FbgColor%3D%21white" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FJSV7aW5pdDogeyd0aGVtZSc6ICdiYXNlJywgJ3RoZW1lVmFyaWFibGVzJzogeyAncHJpbWFyeUNvbG9yJzogJyNmZjlmMWMnLCAnc2Vjb25kYXJ5Q29sb3InOiAnIzJlYzRiNicsICd0ZXJ0aWFyeUNvbG9yJzogJyNlNzFkMzYnLCAncHJpbWFyeUJvcmRlckNvbG9yJzogJyMwMTE2MjcnLCAnbGluZUNvbG9yJzogJyMwMTE2MjcnLCAnZm9udEZhbWlseSc6ICdJbnRlciwgc2Fucy1zZXJpZid9fX0lJQpzZXF1ZW5jZURpYWdyYW0KICAgIHBhcnRpY2lwYW50IEVuZyBhcyBFbmdpbmVlcgogICAgcGFydGljaXBhbnQgQWd0IGFzIENvcnRleCBBZ2VudAogICAgcGFydGljaXBhbnQgQ29tcCBhcyBHUFUgQ2x1c3RlcgogICAgcGFydGljaXBhbnQgRGF0YSBhcyBEYXRhIFdhcmVob3VzZQoKICAgIEVuZy0-PkFndDogIk9wdGltaXplIHRoaXMgY2h1cm4gbW9kZWwiCiAgICBBZ3QtPj5EYXRhOiBBbmFseXplIEZlYXR1cmUgSW1wb3J0YW5jZQogICAgRGF0YS0tPj5BZ3Q6IEZlYXR1cmUgWCBpcyByZWR1bmRhbnQKICAgIEFndC0-PkNvbXA6IFRyaWdnZXIgRGlzdHJpYnV0ZWQgUmV0cmFpbiAod2l0aG91dCBGZWF0dXJlIFgpCiAgICBDb21wLT4-RGF0YTogUHVsbCBPcHRpbWl6ZWQgRGF0YXNldAogICAgQ29tcC0tPj5BZ3Q6IE5ldyBBY2N1cmFjeTogKzIuNCUKICAgIEFndC0-PkVuZzogIlJlbW92ZWQgRmVhdHVyZSBYLCBBY2N1cmFjeSBpbXByb3ZlZCBieSAyLjQlIg%3D%3D%3FbgColor%3D%21white" alt="sequence diagram" width="1249" height="491"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Collapse the Stack:&lt;/strong&gt; Stop moving data to your tools. Move your tools (and your agents) to your data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agents &amp;gt; Pipelines:&lt;/strong&gt; Static DAGs are brittle. Agentic workflows that can reason, fail, and retry represent the future of MLOps.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Focus on the 'What', not the 'How':&lt;/strong&gt; The goal is to transition the data scientist from a "coder" to a "reviewer," allowing them to focus on domain expertise rather than infrastructure debugging.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid Execution:&lt;/strong&gt; Use agents for the complex, iterative development phase, but deploy lean, optimized artifacts for the production inference phase.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Designing GenAI Infrastructure: How to Scale Video Generation</title>
      <dc:creator>Karan Kumar</dc:creator>
      <pubDate>Sun, 12 Apr 2026 19:56:56 +0000</pubDate>
      <link>https://forem.com/karan_kumar_f09865ff0efe9/designing-genai-infrastructure-how-to-scale-video-generation-21bh</link>
      <guid>https://forem.com/karan_kumar_f09865ff0efe9/designing-genai-infrastructure-how-to-scale-video-generation-21bh</guid>
      <description>&lt;p&gt;Your GPU cluster is at 98% utilization. Latency for a five-second video clip has spiked to 40 seconds. Users are reporting timeouts, and your cost-per-inference is eroding your entire margin. &lt;/p&gt;

&lt;p&gt;This is a common breaking point for many AI startups. Standard request-response architectures are fundamentally ill-equipped for the demands of Generative AI. Here is why they fail and how to build a system that actually scales.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Challenge: The GPU Bottleneck
&lt;/h3&gt;

&lt;p&gt;Generating a video is not like serving a traditional REST API. In a typical web application, a request takes milliseconds and consumes negligible CPU. In Generative AI—specifically diffusion models for video—a single request triggers a massive, compute-intensive workload that can last seconds or even minutes.&lt;/p&gt;

&lt;p&gt;If you rely on a synchronous architecture, your API gateway will time out long before the GPU finishes the sampling process. Conversely, simply spinning up more GPUs is a recipe for bankruptcy; GPUs are prohibitively expensive and often sit idle during the "pre-processing" and "post-processing" phases of a pipeline.&lt;/p&gt;

&lt;p&gt;The real difficulty isn't just the raw compute; it's the orchestration. You must manage massive model weights (often gigabytes in size), handle complex asynchronous state transitions, and ensure that a single "heavy" user doesn't starve others of resources. You aren't just building a website; you're building a distributed task scheduler that happens to have a neural network at the end of it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Architecture: Asynchronous Orchestration
&lt;/h3&gt;

&lt;p&gt;To solve this, we must move away from synchronous calls. Instead, we treat every generation request as a "Job." The API does not return a video immediately; it returns a &lt;code&gt;job_id&lt;/code&gt; and a promise that the video will be ready eventually.&lt;/p&gt;

&lt;p&gt;By decoupling the &lt;strong&gt;Request Layer&lt;/strong&gt; (user interaction) from the &lt;strong&gt;Execution Layer&lt;/strong&gt; (GPU compute) using a high-throughput message broker, we can buffer traffic spikes and process jobs based on priority and available hardware capacity.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FJSV7aW5pdDogeyd0aGVtZSc6ICdiYXNlJywgJ3RoZW1lVmFyaWFibGVzJzogeyAncHJpbWFyeUNvbG9yJzogJyNmZjlmMWMnLCAnc2Vjb25kYXJ5Q29sb3InOiAnIzJlYzRiNicsICd0ZXJ0aWFyeUNvbG9yJzogJyNlNzFkMzYnLCAncHJpbWFyeUJvcmRlckNvbG9yJzogJyMwMTE2MjcnLCAnbGluZUNvbG9yJzogJyMwMTE2MjcnLCAnZm9udEZhbWlseSc6ICdJbnRlciwgc2Fucy1zZXJpZid9fX0lJQpncmFwaCBUQgogICAgVXNlcigoVXNlcikpIC0tPiBBUElbQVBJIEdhdGV3YXldCiAgICBBUEkgLS0-IEF1dGhbQXV0aCAmIFJhdGUgTGltaXRlcl0KICAgIEF1dGggLS0-IEpvYlF1ZXVlW0Rpc3RyaWJ1dGVkIEpvYiBRdWV1ZSAvIFJlZGlzXQogICAgSm9iUXVldWUgLS0-IE9yY2hlc3RyYXRvcltKb2IgT3JjaGVzdHJhdG9yXQogICAgT3JjaGVzdHJhdG9yIC0tPiBXb3JrZXJQb29sW0dQVSBXb3JrZXIgUG9vbF0KICAgIFdvcmtlclBvb2wgLS0-IE1vZGVsU3RvcmVbTW9kZWwgV2VpZ2h0cyBTdG9yZSAvIFMzXQogICAgV29ya2VyUG9vbCAtLT4gQ2FjaGVbS1YgQ2FjaGUgLyBSZWRpc10KICAgIFdvcmtlclBvb2wgLS0-IFN0b3JhZ2VbQmxvYiBTdG9yYWdlIC8gUzNdCiAgICBTdG9yYWdlIC0tPiBDRE5bQ0ROIC8gRGVsaXZlcnldCiAgICBDRE4gLS0-IFVzZXI%3D%3FbgColor%3D%21white" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FJSV7aW5pdDogeyd0aGVtZSc6ICdiYXNlJywgJ3RoZW1lVmFyaWFibGVzJzogeyAncHJpbWFyeUNvbG9yJzogJyNmZjlmMWMnLCAnc2Vjb25kYXJ5Q29sb3InOiAnIzJlYzRiNicsICd0ZXJ0aWFyeUNvbG9yJzogJyNlNzFkMzYnLCAncHJpbWFyeUJvcmRlckNvbG9yJzogJyMwMTE2MjcnLCAnbGluZUNvbG9yJzogJyMwMTE2MjcnLCAnZm9udEZhbWlseSc6ICdJbnRlciwgc2Fucy1zZXJpZid9fX0lJQpncmFwaCBUQgogICAgVXNlcigoVXNlcikpIC0tPiBBUElbQVBJIEdhdGV3YXldCiAgICBBUEkgLS0-IEF1dGhbQXV0aCAmIFJhdGUgTGltaXRlcl0KICAgIEF1dGggLS0-IEpvYlF1ZXVlW0Rpc3RyaWJ1dGVkIEpvYiBRdWV1ZSAvIFJlZGlzXQogICAgSm9iUXVldWUgLS0-IE9yY2hlc3RyYXRvcltKb2IgT3JjaGVzdHJhdG9yXQogICAgT3JjaGVzdHJhdG9yIC0tPiBXb3JrZXJQb29sW0dQVSBXb3JrZXIgUG9vbF0KICAgIFdvcmtlclBvb2wgLS0-IE1vZGVsU3RvcmVbTW9kZWwgV2VpZ2h0cyBTdG9yZSAvIFMzXQogICAgV29ya2VyUG9vbCAtLT4gQ2FjaGVbS1YgQ2FjaGUgLyBSZWRpc10KICAgIFdvcmtlclBvb2wgLS0-IFN0b3JhZ2VbQmxvYiBTdG9yYWdlIC8gUzNdCiAgICBTdG9yYWdlIC0tPiBDRE5bQ0ROIC8gRGVsaXZlcnldCiAgICBDRE4gLS0-IFVzZXI%3D%3FbgColor%3D%21white" alt="architecture diagram" width="745" height="789"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Core Components: The Engine Room
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. The Job Orchestrator
&lt;/h4&gt;

&lt;p&gt;The orchestrator is the brain of the system. It doesn't perform the mathematical computations; it manages the state. It determines which worker receives which job. For example, if a user is on a "Pro" plan, the orchestrator routes their job to a high-priority queue. If a worker crashes—a frequent occurrence due to CUDA Out-of-Memory (OOM) errors—the orchestrator detects the heartbeat failure and automatically requeues the job.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. The GPU Worker Pool
&lt;/h4&gt;

&lt;p&gt;Workers are highly specialized. To avoid the inefficiency of loading a 20GB model from S3 for every request, workers keep models "warm" in VRAM. We employ a sidecar pattern to monitor GPU health and memory pressure, ensuring new jobs aren't pushed to a worker already at 95% VRAM utilization.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. The Model Store
&lt;/h4&gt;

&lt;p&gt;Loading models is the primary bottleneck during cold starts. We use a tiered approach: a global S3 bucket serves as the source of truth, while a local NVMe cache on the GPU nodes handles rapid access. This significantly reduces the "time to first token/frame."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FJSV7aW5pdDogeyd0aGVtZSc6ICdiYXNlJywgJ3RoZW1lVmFyaWFibGVzJzogeyAncHJpbWFyeUNvbG9yJzogJyNmZjlmMWMnLCAnc2Vjb25kYXJ5Q29sb3InOiAnIzJlYzRiNicsICd0ZXJ0aWFyeUNvbG9yJzogJyNlNzFkMzYnLCAncHJpbWFyeUJvcmRlckNvbG9yJzogJyMwMTE2MjcnLCAnbGluZUNvbG9yJzogJyMwMTE2MjcnLCAnZm9udEZhbWlseSc6ICdJbnRlciwgc2Fucy1zZXJpZid9fX0lJQpzZXF1ZW5jZURpYWdyYW0KICAgIHBhcnRpY2lwYW50IFUgYXMgVXNlcgogICAgcGFydGljaXBhbnQgQSBhcyBBUEkKICAgIHBhcnRpY2lwYW50IFEgYXMgUXVldWUKICAgIHBhcnRpY2lwYW50IFcgYXMgR1BVIFdvcmtlcgogICAgcGFydGljaXBhbnQgUyBhcyBTMyBTdG9yYWdlCgogICAgVS0-PkE6IFBPU1QgL2dlbmVyYXRlIChQcm9tcHQpCiAgICBBLT4-UTogUHVzaCBKb2Ige2lkOiAxMjMsIHByaW9yaXR5OiBoaWdofQogICAgQS0tPj5VOiAyMDIgQWNjZXB0ZWQgKGpvYl9pZDogMTIzKQogICAgUS0-Plc6IFB1bGwgSm9iIDEyMwogICAgVy0-Plc6IFJ1biBEaWZmdXNpb24gUHJvY2VzcwogICAgVy0-PlM6IFVwbG9hZCAubXA0IFJlc3VsdAogICAgVy0-PlE6IE1hcmsgSm9iIDEyMyBDb21wbGV0ZQogICAgVS0-PkE6IEdFVCAvc3RhdHVzLzEyMwogICAgQS0-PlE6IENoZWNrIFN0YXR1cwogICAgUS0tPj5BOiBDb21wbGV0ZWQgLyBVUkwKICAgIEEtLT4-VTogMjAwIE9LICh2aWRlb191cmwp%3FbgColor%3D%21white" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FJSV7aW5pdDogeyd0aGVtZSc6ICdiYXNlJywgJ3RoZW1lVmFyaWFibGVzJzogeyAncHJpbWFyeUNvbG9yJzogJyNmZjlmMWMnLCAnc2Vjb25kYXJ5Q29sb3InOiAnIzJlYzRiNicsICd0ZXJ0aWFyeUNvbG9yJzogJyNlNzFkMzYnLCAncHJpbWFyeUJvcmRlckNvbG9yJzogJyMwMTE2MjcnLCAnbGluZUNvbG9yJzogJyMwMTE2MjcnLCAnZm9udEZhbWlseSc6ICdJbnRlciwgc2Fucy1zZXJpZid9fX0lJQpzZXF1ZW5jZURpYWdyYW0KICAgIHBhcnRpY2lwYW50IFUgYXMgVXNlcgogICAgcGFydGljaXBhbnQgQSBhcyBBUEkKICAgIHBhcnRpY2lwYW50IFEgYXMgUXVldWUKICAgIHBhcnRpY2lwYW50IFcgYXMgR1BVIFdvcmtlcgogICAgcGFydGljaXBhbnQgUyBhcyBTMyBTdG9yYWdlCgogICAgVS0-PkE6IFBPU1QgL2dlbmVyYXRlIChQcm9tcHQpCiAgICBBLT4-UTogUHVzaCBKb2Ige2lkOiAxMjMsIHByaW9yaXR5OiBoaWdofQogICAgQS0tPj5VOiAyMDIgQWNjZXB0ZWQgKGpvYl9pZDogMTIzKQogICAgUS0-Plc6IFB1bGwgSm9iIDEyMwogICAgVy0-Plc6IFJ1biBEaWZmdXNpb24gUHJvY2VzcwogICAgVy0-PlM6IFVwbG9hZCAubXA0IFJlc3VsdAogICAgVy0-PlE6IE1hcmsgSm9iIDEyMyBDb21wbGV0ZQogICAgVS0-PkE6IEdFVCAvc3RhdHVzLzEyMwogICAgQS0-PlE6IENoZWNrIFN0YXR1cwogICAgUS0tPj5BOiBDb21wbGV0ZWQgLyBVUkwKICAgIEEtLT4-VTogMjAwIE9LICh2aWRlb191cmwp%3FbgColor%3D%21white" alt="sequence diagram" width="1205" height="699"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Data &amp;amp; Workflow: The Lifecycle of a Frame
&lt;/h3&gt;

&lt;p&gt;Data doesn't simply flow from prompt to video; it passes through a rigorous pipeline of transformations.&lt;/p&gt;

&lt;p&gt;First, the &lt;strong&gt;Prompt Processor&lt;/strong&gt; cleans the input, applies safety filters to prevent NSFW content, and may expand a simple prompt into a detailed one using a smaller, faster LLM.&lt;/p&gt;

&lt;p&gt;Second is the &lt;strong&gt;Sampling Loop&lt;/strong&gt;. The GPU doesn't "create" a video in one pass; it iteratively removes noise from a latent representation. This is the most time-consuming phase. We utilize techniques like &lt;em&gt;FlashAttention&lt;/em&gt; to optimize the memory footprint of the attention layers.&lt;/p&gt;

&lt;p&gt;Finally, the &lt;strong&gt;VAE Decoder&lt;/strong&gt; takes over. The result of the diffusion process exists in "latent space" (a compressed format). A Variational Autoencoder (VAE) is required to decode these latents back into actual pixels. Because this is a separate compute step, it can often be offloaded to a cheaper GPU or even a high-end CPU if latency is not the primary concern.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trade-offs &amp;amp; Scalability
&lt;/h3&gt;

&lt;p&gt;Scaling a GenAI system requires making strategic choices about where to sacrifice performance for cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency vs. Throughput:&lt;/strong&gt; For the lowest possible latency, you would keep one model per GPU and process one request at a time—but this is an inefficient use of resources. To increase throughput, we use &lt;strong&gt;Continuous Batching&lt;/strong&gt;. Instead of waiting for one video to finish, we slot new requests into the GPU's processing loop as soon as a slot opens. This can increase throughput by 2x–4x, with only a slight increase in individual request latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VRAM Management:&lt;/strong&gt; The most common failure point is the Out-of-Memory (OOM) error. We implement &lt;strong&gt;Model Sharding&lt;/strong&gt; (splitting the model across multiple GPUs) for massive models. For smaller models, we use &lt;strong&gt;Quantization&lt;/strong&gt; (converting 32-bit floats to 8-bit or 4-bit), which cuts memory usage in half with minimal impact on visual quality.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FJSV7aW5pdDogeyd0aGVtZSc6ICdiYXNlJywgJ3RoZW1lVmFyaWFibGVzJzogeyAncHJpbWFyeUNvbG9yJzogJyNmZjlmMWMnLCAnc2Vjb25kYXJ5Q29sb3InOiAnIzJlYzRiNicsICd0ZXJ0aWFyeUNvbG9yJzogJyNlNzFkMzYnLCAncHJpbWFyeUJvcmRlckNvbG9yJzogJyMwMTE2MjcnLCAnbGluZUNvbG9yJzogJyMwMTE2MjcnLCAnZm9udEZhbWlseSc6ICdJbnRlciwgc2Fucy1zZXJpZid9fX0lJQpncmFwaCBURAogICAgQVtSZXF1ZXN0IEFycml2YWxdIC0tPiBCe1ByaW9yaXR5P30KICAgIEIgLS0gSGlnaCAtLT4gQ1tQcmlvcml0eSBRdWV1ZV0KICAgIEIgLS0gTG93IC0tPiBEW1N0YW5kYXJkIFF1ZXVlXQogICAgQyAtLT4gRVtXb3JrZXIgd2l0aCBXYXJtIE1vZGVsXQogICAgRCAtLT4gRQogICAgRSAtLT4gRntWUkFNIEF2YWlsYWJsZT99CiAgICBGIC0tIFllcyAtLT4gR1tQcm9jZXNzIEJhdGNoXQogICAgRiAtLSBObyAtLT4gSFtXYWl0L1NjYWxlIFVwXQogICAgRyAtLT4gSVtWQUUgRGVjb2RpbmddCiAgICBJIC0tPiBKW1MzIFVwbG9hZF0%3D%3FbgColor%3D%21white" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FJSV7aW5pdDogeyd0aGVtZSc6ICdiYXNlJywgJ3RoZW1lVmFyaWFibGVzJzogeyAncHJpbWFyeUNvbG9yJzogJyNmZjlmMWMnLCAnc2Vjb25kYXJ5Q29sb3InOiAnIzJlYzRiNicsICd0ZXJ0aWFyeUNvbG9yJzogJyNlNzFkMzYnLCAncHJpbWFyeUJvcmRlckNvbG9yJzogJyMwMTE2MjcnLCAnbGluZUNvbG9yJzogJyMwMTE2MjcnLCAnZm9udEZhbWlseSc6ICdJbnRlciwgc2Fucy1zZXJpZid9fX0lJQpncmFwaCBURAogICAgQVtSZXF1ZXN0IEFycml2YWxdIC0tPiBCe1ByaW9yaXR5P30KICAgIEIgLS0gSGlnaCAtLT4gQ1tQcmlvcml0eSBRdWV1ZV0KICAgIEIgLS0gTG93IC0tPiBEW1N0YW5kYXJkIFF1ZXVlXQogICAgQyAtLT4gRVtXb3JrZXIgd2l0aCBXYXJtIE1vZGVsXQogICAgRCAtLT4gRQogICAgRSAtLT4gRntWUkFNIEF2YWlsYWJsZT99CiAgICBGIC0tIFllcyAtLT4gR1tQcm9jZXNzIEJhdGNoXQogICAgRiAtLSBObyAtLT4gSFtXYWl0L1NjYWxlIFVwXQogICAgRyAtLT4gSVtWQUUgRGVjb2RpbmddCiAgICBJIC0tPiBKW1MzIFVwbG9hZF0%3D%3FbgColor%3D%21white" alt="architecture diagram" width="383" height="1021"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Scaling Wall:&lt;/strong&gt; Eventually, you will hit the "Cold Start" wall. When scaling from 10 to 100 GPUs, the time required to pull 20GB of weights from S3 can saturate your network. The solution is a peer-to-peer (P2P) distribution system among workers or a dedicated high-speed model cache layer using a tool like JuiceFS.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Never use synchronous APIs for GenAI.&lt;/strong&gt; Always implement a Job-Queue-Worker pattern to avoid timeouts and manage GPU spikes.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Model warmth is critical.&lt;/strong&gt; The cost of loading weights from disk to VRAM is your biggest latency killer; cache models aggressively on local NVMe.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Batching is essential for survival.&lt;/strong&gt; Implement continuous batching and quantization to maximize GPU throughput and lower your cost-per-generation.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Decouple the VAE.&lt;/strong&gt; Separate latent diffusion (heavy compute) from pixel decoding (lighter compute) to optimize hardware allocation.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>performance</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>How to Build an Agentic ML Pipeline: From Natural Language to Production</title>
      <dc:creator>Karan Kumar</dc:creator>
      <pubDate>Sun, 12 Apr 2026 19:11:43 +0000</pubDate>
      <link>https://forem.com/karan_kumar_f09865ff0efe9/how-to-build-an-agentic-ml-pipeline-from-natural-language-to-production-5054</link>
      <guid>https://forem.com/karan_kumar_f09865ff0efe9/how-to-build-an-agentic-ml-pipeline-from-natural-language-to-production-5054</guid>
      <description>&lt;p&gt;By the end of this post, you'll be able to design an agentic ML system that automates the path from raw data to predictive insights. You will learn how to eliminate the "context-switching tax" in data science and architect a closed-loop system where AI agents handle the tedious plumbing of feature engineering and hyperparameter tuning.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Challenge: The "Plumbing" Problem in ML
&lt;/h3&gt;

&lt;p&gt;Most ML projects don't fail because the math is wrong; they fail because the plumbing is broken.&lt;/p&gt;

&lt;p&gt;If you've ever deployed a model to production, you know the drill: you spend 10% of your time on actual model architecture and 90% wrestling with data pipelines, debugging CUDA errors, stitching together fragmented APIs, and manually tracking hyperparameters in a spreadsheet. This is the "context-switching tax." You jump from a Jupyter notebook to a terminal, then to a cloud console, and finally to a documentation page—all to figure out why a distributed training job just crashed.&lt;/p&gt;

&lt;p&gt;At scale, this manual overhead becomes a critical bottleneck. When an organization like the First National Bank of Omaha needs to run anomaly detection on call center analytics, they cannot afford a three-week cycle just to test a new feature hypothesis. The friction between &lt;em&gt;idea&lt;/em&gt; and &lt;em&gt;execution&lt;/em&gt; is where most ML ROI goes to die.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Architecture: Agentic ML
&lt;/h3&gt;

&lt;p&gt;Traditional ML pipelines are linear: 

&lt;/p&gt;
&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;Data→Preprocessing→Training→Deployment
\text{Data} \rightarrow \text{Preprocessing} \rightarrow \text{Training} \rightarrow \text{Deployment}
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt;Data&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;→&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt;Preprocessing&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;→&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt;Training&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;→&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt;Deployment&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;
. If a failure occurs at the end of the chain, the developer must manually loop back to the start.

&lt;p&gt;Agentic ML flips this paradigm. Instead of a static pipeline, we introduce an AI Coding Agent (such as Snowflake's Cortex Code) that sits &lt;em&gt;above&lt;/em&gt; the infrastructure. This agent doesn't just write code; it reasons about the data, selects the optimal tool for the job, and executes the workflow within the governed environment where the data resides.&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
  &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6isw6zsoundp50q9vuzg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6isw6zsoundp50q9vuzg.png" alt="Diagram of Agentic ML Architecture" width="800" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Core Components: The Brain and the Brawn
&lt;/h3&gt;

&lt;p&gt;To make this system viable, you must separate the &lt;strong&gt;Reasoning Layer&lt;/strong&gt; (the Brain) from the &lt;strong&gt;Execution Layer&lt;/strong&gt; (the Brawn).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The Reasoning Layer (The Agent)&lt;/strong&gt;&lt;br&gt;
This is where the LLM resides. It takes a prompt—such as &lt;em&gt;"Build a churn model and tell me why users are leaving"&lt;/em&gt;—and decomposes it into a Directed Acyclic Graph (DAG) of tasks. Rather than guessing, the agent utilizes "skills"—pre-defined technical capabilities it can trigger.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The Execution Layer (The Infrastructure)&lt;/strong&gt;&lt;br&gt;
This is where the heavy lifting happens. To avoid the latency of moving petabytes of data to a separate ML server, execution occurs &lt;em&gt;in-situ&lt;/em&gt;. By utilizing GPU-accelerated clusters that scale elastically, the system ensures that when an agent triggers a distributed XGBoost training job, it brings the compute to the data, rather than moving the data to the compute.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvrg5wpxfy0eykteawu4z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvrg5wpxfy0eykteawu4z.png" alt="Flowchart of the Reasoning vs Execution layer" width="800" height="335"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Data &amp;amp; Workflow: Closing the Loop
&lt;/h3&gt;

&lt;p&gt;The true power of this architecture lies in the iterative loop. In a traditional setup, evaluating feature importance requires writing a script, running it, plotting a graph, and manually deciding on the next step.&lt;/p&gt;

&lt;p&gt;In an agentic workflow, the agent manages the 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;Observation→Orientation→Decision→Action (OODA)\text{Observation} \rightarrow \text{Orientation} \rightarrow \text{Decision} \rightarrow \text{Action (OODA)} &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt;Observation&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;→&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt;Orientation&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;→&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt;Decision&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;→&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt;Action (OODA)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 loop:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Observation:&lt;/strong&gt; The agent analyzes the current model's residuals to identify where it is failing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Orientation:&lt;/strong&gt; It compares these failures against the available data schema to identify missing signals.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decision:&lt;/strong&gt; It decides to create a new lagging feature (e.g., "average spend over the last 30 days").&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action:&lt;/strong&gt; It writes the necessary SQL/Python code to generate that feature and triggers a re-train.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This transforms the data scientist from a "coder who cleans data" into an "architect who reviews strategies."&lt;/p&gt;

&lt;h3&gt;
  
  
  Trade-offs &amp;amp; Scalability
&lt;/h3&gt;

&lt;p&gt;Transitioning to an agentic system is not a "free lunch"; there are significant engineering trade-offs to consider.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency vs. Throughput&lt;/strong&gt;&lt;br&gt;
Agentic reasoning introduces overhead. An LLM taking five seconds to "plan" a task is negligible for a training pipeline that takes two hours, but it is a non-starter for real-time inference. Consequently, the &lt;em&gt;Agent&lt;/em&gt; manages the &lt;em&gt;Pipeline&lt;/em&gt;, but the &lt;em&gt;Pipeline&lt;/em&gt; itself remains a high-performance compiled binary (like XGBoost) for actual predictions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Governance Paradox&lt;/strong&gt;&lt;br&gt;
Granting an agent the power to write and execute code on production data can be daunting. The solution is a "Governed Sandbox." The agent operates within the existing Role-Based Access Control (RBAC) of the data cloud. If a user lacks permission to view PII data, the agent cannot "hallucinate" a way to access it, as the execution layer enforces the same permissions as a standard SQL query.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compute Efficiency&lt;/strong&gt;&lt;br&gt;
Distributed training is expensive. A naive agent might trigger 100 training runs to find the optimal hyperparameter. To scale this, &lt;strong&gt;Early Stopping&lt;/strong&gt; and &lt;strong&gt;Bayesian Optimization&lt;/strong&gt; must be baked into the agent's skills, ensuring it converges on a solution with the minimum number of GPU hours.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fozgi27xidxpwzgvdc1oq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fozgi27xidxpwzgvdc1oq.png" alt="Diagram summarizing ML Agent trade-offs: Latency, Governance, and Compute Efficiency" width="689" height="641"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Kill the Context Switch:&lt;/strong&gt; The goal of Agentic ML is to merge the development and data environments. Moving data to a separate VM for training is a productivity leak.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Skills over Scripts:&lt;/strong&gt; Avoid building a monolithic agent. Instead, develop a library of "ML Skills" (e.g., &lt;code&gt;feature_importance_analysis&lt;/code&gt;, &lt;code&gt;hyperparameter_tune&lt;/code&gt;) that the agent can call as tools.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Governance is Non-Negotiable:&lt;/strong&gt; Agentic systems must inherit the security model of the underlying data store. Never allow an agent to bypass RBAC.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Focus on the OODA Loop:&lt;/strong&gt; The primary value is not in code generation, but in the agent's ability to observe model failure and autonomously propose a fix.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>aiagents</category>
      <category>systemdesign</category>
      <category>distributedsystems</category>
    </item>
  </channel>
</rss>
