<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Pushkar Gulkari</title>
    <description>The latest articles on Forem by Pushkar Gulkari (@pushkar_gulkari).</description>
    <link>https://forem.com/pushkar_gulkari</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3884197%2Fb485847e-608a-4b2b-9c73-089b0bbcb4a5.png</url>
      <title>Forem: Pushkar Gulkari</title>
      <link>https://forem.com/pushkar_gulkari</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/pushkar_gulkari"/>
    <language>en</language>
    <item>
      <title>LLM-Native APIs: How the Runtime behind REST Changed Fundamentally in 2026</title>
      <dc:creator>Pushkar Gulkari</dc:creator>
      <pubDate>Fri, 24 Apr 2026 08:13:16 +0000</pubDate>
      <link>https://forem.com/epam_india_python/llm-native-apis-how-the-runtime-behind-rest-changed-fundamentally-in-2026-508f</link>
      <guid>https://forem.com/epam_india_python/llm-native-apis-how-the-runtime-behind-rest-changed-fundamentally-in-2026-508f</guid>
      <description>&lt;h3&gt;
  
  
  &lt;strong&gt;Introduction&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;For over a decade, the runtime behind a REST endpoint made a set of assumptions that were safe to make. A request maps to a single, predictable operation. The response shape is known before execution begins. Each request is self-contained — no memory of what came before (stateless). Business logic is deterministic: same input, same output, every time&lt;/p&gt;

&lt;p&gt;These assumptions held because they matched the workload. CRUD operations, relational queries, rule-based decisions — all of these are stateless, deterministic, and fast. REST was designed around them and served them well. But non-determinism is not new to backend systems. Recommender systems have been probabilistic for 15+ years long before LLMs existed. None of this is novel territory.&lt;/p&gt;

&lt;p&gt;What &lt;em&gt;is&lt;/em&gt; new is the general-purpose reasoning black box sitting behind your endpoint — a system that interprets intent, invokes tools dynamically, and produces outputs. The current challenge is variable latency, variable cost, unbounded tool use, and stateful multi-step execution — all behind an endpoint that looks exactly like a REST API to the client.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Traditional REST APIs before the LLM Era:&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;REST endpoints&lt;/strong&gt; – Predefined endpoints responsible for specific operations like fetching, saving, and updating data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deterministic behavior&lt;/strong&gt; – The outcome, format, and response structure were known in advance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strict schemas&lt;/strong&gt; – Systems relied on predefined schemas and models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stateless interactions&lt;/strong&gt; – Each request was self-contained and independent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rule-based business&lt;/strong&gt; – Logic has long been the backbone of backend systems, translating requirements into deterministic "if–then" decisions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;"LLM workloads don't break REST. They break the runtime assumptions your backend was built on."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Coming to 2026 – LLM Era:&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Applications are no longer asking for predictable responses. When a user asks:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Analyse these 4 PDFs, compare insights, and tell me the risks."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The execution path is decided at runtime by a reasoning engine. The operation takes 20–30 seconds and may invoke a dozen tools along the way. The result is non-deterministic: run it twice, get two different outputs.&lt;/p&gt;

&lt;p&gt;This isn't REST evolving. The protocol is the same. What's changed is the &lt;strong&gt;runtime behind the endpoint&lt;/strong&gt; — and that runtime now needs to handle things that traditional backends were never designed for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning engines&lt;/strong&gt; that interpret intent rather than match routes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stateful workflows&lt;/strong&gt; that span multiple steps, tools, and model calls&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-deterministic outputs&lt;/strong&gt; that can't be regression-tested the same way&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent coordinators&lt;/strong&gt; – Orchestrate multiple specialized agents to complete complex tasks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory&lt;/strong&gt; that persists context across requests and sessions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The shift is not about adopting new protocols. It's about recognising that the contract your endpoint exposes stays simple — while the system behind it becomes fundamentally more complex. This article breaks down what that runtime looks like, what it costs, and where it fails.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;What Traditional REST Assumed&lt;/strong&gt;
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Assumption&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Reality with LLM Workloads&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fixed response schema&lt;/td&gt;
&lt;td&gt;Generative, variable output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stateless per request&lt;/td&gt;
&lt;td&gt;Multi-step, session-aware execution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deterministic logic&lt;/td&gt;
&lt;td&gt;Probabilistic reasoning engine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Millisecond latency&lt;/td&gt;
&lt;td&gt;10–30s per complex request&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rule-based routing&lt;/td&gt;
&lt;td&gt;Intent-driven dynamic task planning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Predictable cost&lt;/td&gt;
&lt;td&gt;Variable — $0.01 to $1.00+ per request&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The 3-Layer Architecture of LLM-Native APIs&lt;/strong&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;1. The Orchestration Layer&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;(Reasoning + Tools + Workflow)&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The orchestration layer in LLM-based REST APIs acts as the central control plane that transforms high-level user intent into coordinated, executable workflows. Unlike traditional backends, where requests map directly to a single service or endpoint, the orchestration layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extracts intent&lt;/strong&gt; — interprets what the user wants, not just what they typed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plans execution&lt;/strong&gt; — builds a task graph dynamically based on context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Routes and coordinates&lt;/strong&gt; — dispatches to retrieval systems, tools, and external services&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manages state&lt;/strong&gt; — maintains context across steps, handles retry, and feeds intermediate outputs into subsequent stages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is what separates an LLM-native backend from simply wrapping a model call in a FastAPI route.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Scenario&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Stack&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;When to pick it&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Simple agent, workflows &amp;lt; 30s&lt;/td&gt;
&lt;td&gt;FastAPI + LangGraph + pgvector + Celery&lt;/td&gt;
&lt;td&gt;Early stage, Postgres already in use, &amp;lt; 1M vectors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long-running durable workflows &amp;gt; 5 min&lt;/td&gt;
&lt;td&gt;FastAPI + Temporal + Pinecone + LangGraph&lt;/td&gt;
&lt;td&gt;Workflows must survive crashes; partial state has value&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost-sensitive, high Postgres investment&lt;/td&gt;
&lt;td&gt;FastAPI + pgvector + Pydantic-AI + Inngest&lt;/td&gt;
&lt;td&gt;Avoiding infra sprawl; &amp;lt; 5M vectors; moderate QPS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maximum control, latency-critical&lt;/td&gt;
&lt;td&gt;FastAPI + raw asyncio + Qdrant + custom retry&lt;/td&gt;
&lt;td&gt;P95 &amp;lt; 100ms target; team willing to own retry/backoff logic&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;The MCP (Model Context Protocol) Tools:&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Advanced API capabilities are exposed as &lt;strong&gt;MCP tools&lt;/strong&gt;, which are created and invoked to get the required data from external tools/data sources like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Databases, Data warehouses, Vector databases&lt;/li&gt;
&lt;li&gt;File storage and document systems&lt;/li&gt;
&lt;li&gt;Monitoring and analytics tools&lt;/li&gt;
&lt;li&gt;Internal microservices&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;MCP introduces a &lt;strong&gt;schema-driven interface&lt;/strong&gt; where tools are discoverable and callable by the model. MCP enables a declarative approach where tools are exposed as first-class, machine-readable entities, allowing LLMs to reason about when and how to use them.&lt;/p&gt;

&lt;p&gt;In traditional API architectures, orchestration logic resides entirely within backend services, with developers explicitly defining control flow and integrations. MCP fundamentally changes this paradigm by elevating the LLM into an active participant in system execution and decision-making. MCP introduces layer of governance and safety in LLM-driven systems. Enforcing schemas, input validation, and access controls at the tool level ensures that model actions remain predictable and auditable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F79r4ye5k5rkd2qid2i76.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F79r4ye5k5rkd2qid2i76.png" alt=" " width="800" height="356"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;2. The Memory Layer&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;(Short-Term + Long-Term + Semantic Memory)&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Memory solves one problem: &lt;strong&gt;context doesn't survive across steps or sessions by default.&lt;/strong&gt; Without it, every request starts blind — no knowledge of prior interactions, no intermediate state, no retrieved domain knowledge. Though not everything worth computing is worth storing. Storing too much degrades retrieval quality. The more noise in your vector store, the more confidently wrong results you get back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;What should we store?&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Document embeddings + chunk metadata&lt;/li&gt;
&lt;li&gt;Final summarised outputs&lt;/li&gt;
&lt;li&gt;Session context (within TTL)&lt;/li&gt;
&lt;li&gt;User-level preferences&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Memory Types and Their Limits&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Short-term memory&lt;/strong&gt; — Session-level context held in-memory or fast cache (Redis).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Expires with the session&lt;/li&gt;
&lt;li&gt;Safe to use freely; cost is low and staleness isn't a risk&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Long-term memory&lt;/strong&gt; — Vector-based semantic storage (pgvector, Pinecone, LanceDB).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Survives across sessions; powers RAG retrieval&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Risk:&lt;/strong&gt; Gets stale. A document embedded 6 months ago may no longer reflect current reality. Without TTL policies, old context poisons new queries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Workflow memory&lt;/strong&gt; — Intermediate execution state across steps.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enables resumption after failure or cancellation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Risk:&lt;/strong&gt; Partial state from a failed run can corrupt a retry if not versioned or cleared correctly&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Where Memory Fails&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Vector stores are lossy.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Embedding-based retrieval doesn't return the &lt;em&gt;correct&lt;/em&gt; chunk — it returns the &lt;em&gt;most similar&lt;/em&gt; chunk. On ambiguous or underspecified queries, that's often the wrong one. The model then reasons confidently on bad input. The output looks plausible. It isn't.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Embeddings drift across model versions.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If you upgrade your embedding model, every stored vector becomes semantically misaligned with new queries. Searches degrade silently — no errors, just worse results.&lt;/li&gt;
&lt;li&gt;Always version-stamp embeddings and plan for periodic re-indexing when upgrading models to avoid these issues.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Stale memory hurts reasoning.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A chunk retrieved from a session 3 months ago may contradict the current document set. Without TTL policies per memory type, the system treats outdated context as ground truth. Define explicit expiry for each memory tier.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Retrieval confidence is not retrieval accuracy.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The model has no way to know that a retrieved chunk is wrong — it treats retrieved content as authoritative. There is no built-in scepticism. This means garbage in, confident garbage out. Never treat retrieved chunks as ground truth — surface retrieval confidence in traces.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;3. The Interaction Layer&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;(API Gateway + Protocols)&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The interaction layer in LLM-based REST APIs serves as the primary touchpoint between clients and the underlying intelligence of the system, translating human intent into structured requests and delivering responses in a consumable form.&lt;/p&gt;

&lt;p&gt;Unlike traditional APIs that expose rigid, operation-specific endpoints, the interaction layer is designed around intent-driven communication, where a single endpoint can handle a wide range of tasks expressed in natural language. It is responsible for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Request validation&lt;/li&gt;
&lt;li&gt;Authentication&lt;/li&gt;
&lt;li&gt;Context injection&lt;/li&gt;
&lt;li&gt;Input transformation (e.g., Pydantic schemas)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On the response side, it standardizes outputs—whether textual insights, structured data, or progressive updates (streams of data).&lt;/p&gt;

&lt;p&gt;Here are the examples of&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chat-style endpoints&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;/chat – conversational&lt;/li&gt;
&lt;li&gt;/agent – tool-driven workflow executor&lt;/li&gt;
&lt;li&gt;/reason – produce structured reasoning&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Function-calling endpoints&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;/function-call – structured tool calls&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;In certain cases, the interaction layer can leverage Server-Sent Events (SSE) to provide a streaming interface for real-time feedback. For long running or multi-step tasks, SSE enables the server to push incremental updates, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Processing status&lt;/li&gt;
&lt;li&gt;Partial summaries&lt;/li&gt;
&lt;li&gt;Evolving insights—directly to the client over a single HTTP connection.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This significantly improves user experience by reducing latency and increasing transparency into system behavior.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Streaming responses&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;/stream – stream tokens&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;However, SSE is used strictly as a delivery mechanism within the interaction layer and does not replace the underlying asynchronous execution systems. It allows LLM-based APIs to feel responsive and interactive while still relying on robust orchestration and processing layers behind the scenes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvmafc61l409s343195cj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvmafc61l409s343195cj.png" alt=" " width="800" height="617"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;End-to-End LLM Request Lifecycle&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;User asks: &lt;strong&gt;&lt;em&gt;"Analyse these 4 PDFs, compare insights, and tell me the risks."&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Step 1 — Request Ingestion *(Interaction Layer)&lt;/strong&gt;*
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Validate input schema, auth, document URLs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fail fast:&lt;/strong&gt; Return 422 before any LLM call if validation fails — saves cost&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Step 2 — Interaction Mode Setup *(Interaction Layer)&lt;/strong&gt;*
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Decide: sync response or SSE streaming&lt;/li&gt;
&lt;li&gt;Issue a &lt;strong&gt;job ID&lt;/strong&gt; immediately — acts as resumption token if SSE connection drops&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Step 3 — Intent Parsing &amp;amp; Task Decomposition *(Orchestration Layer)&lt;/strong&gt;*
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;LLM breaks prompt into task graph: Ingest → Extract → Summarize → Compare → Risk&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Guard:&lt;/strong&gt; If parsed plan looks incomplete or ambiguous, surface a clarification prompt — don't proceed into an expensive workflow on a flawed plan&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Step 4 — Document Ingestion *(Memory Layer)&lt;/strong&gt;*
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Scenario&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Recovery&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Scanned PDF (no text layer)&lt;/td&gt;
&lt;td&gt;Trigger OCR fallback&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Password-protected&lt;/td&gt;
&lt;td&gt;Flag, skip, notify user&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Corrupted / unreachable&lt;/td&gt;
&lt;td&gt;Retry × 3 with backoff, then skip&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Rule:&lt;/strong&gt; One bad document should never abort the entire workflow. Continue with remaining documents.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Step 5 — Text Extraction &amp;amp; Chunking *(Memory Layer)&lt;/strong&gt;*
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Extract text; split into chunks&lt;/li&gt;
&lt;li&gt;Filter low-confidence OCR output — don't embed junk&lt;/li&gt;
&lt;li&gt;Chunk size must be calibrated against model context window limits&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Step 6 — Embedding &amp;amp; Vector Storage *(Memory Layer)&lt;/strong&gt;*
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Convert chunks → embeddings → vector DB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate limits:&lt;/strong&gt; Retry with exponential backoff, not hard failure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version drift:&lt;/strong&gt; Embeddings are model-version specific — version-stamp everything; plan for re-indexing on model upgrades&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Step 7 — Parallel Document Processing *(Orchestration Layer)&lt;/strong&gt;*
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Summaries all 4 documents concurrently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partial failure:&lt;/strong&gt; If 3 of 4 succeed, proceed — don't abort for one timeout&lt;/li&gt;
&lt;li&gt;Set &lt;strong&gt;per-document timeouts&lt;/strong&gt;, not a single global one&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Step 8 — Cross-Document Reasoning *(Orchestration Layer)&lt;/strong&gt;*
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Compare summaries; identify overlaps, conflicts; generate risks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context overflow:&lt;/strong&gt; Combined summaries may exceed context window — use map-reduce (reason over pairs, then synthesize). Never silently truncate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning loops:&lt;/strong&gt; Cap tool invocations (e.g. max 20 steps) with a hard circuit breaker&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Step 9 — Response Aggregation *(Orchestration Layer)&lt;/strong&gt;*
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Combine insights + comparisons + risks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partial failure:&lt;/strong&gt; If one component (e.g. risk analysis) fails, return what succeeded with clear metadata — never return a generic error&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Step 10 — Cancellation &amp;amp; Timeout Handling *(Orchestration + Interaction Layers)&lt;/strong&gt;*
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Propagate cancellation signal down to async tasks when user aborts&lt;/li&gt;
&lt;li&gt;Persist any intermediate results produced so far&lt;/li&gt;
&lt;li&gt;Without this: backend keeps running, burning LLM credits, after the user has left&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Step 11 — Response Delivery *(Interaction Layer)&lt;/strong&gt;*
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Stream via SSE or return full response&lt;/li&gt;
&lt;li&gt;Run basic schema validation on LLM output before delivery — especially if downstream systems consume it programmatically&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Step 12 — Memory Persistence *(Memory Layer)&lt;/strong&gt;*
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Store embeddings, summaries, final output&lt;/li&gt;
&lt;li&gt;Set &lt;strong&gt;TTL policies&lt;/strong&gt; — stale memory retrieved months later can hurt reasoning&lt;/li&gt;
&lt;li&gt;Check memory before re-running on retry — enables &lt;strong&gt;idempotency&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Failure Surface Summary&lt;/strong&gt;
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Step&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Failure Mode&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Recovery&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Request Ingestion&lt;/td&gt;
&lt;td&gt;Bad schema / unreachable URL&lt;/td&gt;
&lt;td&gt;422 before LLM call&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interaction Setup&lt;/td&gt;
&lt;td&gt;SSE drops&lt;/td&gt;
&lt;td&gt;Resumption via job ID&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Intent Parsing&lt;/td&gt;
&lt;td&gt;Hallucinated / incomplete plan&lt;/td&gt;
&lt;td&gt;Confidence gate → clarify&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Document Ingestion&lt;/td&gt;
&lt;td&gt;Scanned / corrupt / protected&lt;/td&gt;
&lt;td&gt;Per-doc fallback; partial proceed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Extraction&lt;/td&gt;
&lt;td&gt;OCR noise / garbled text&lt;/td&gt;
&lt;td&gt;Quality filter; tag low confidence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Embedding&lt;/td&gt;
&lt;td&gt;Rate limit / model drift&lt;/td&gt;
&lt;td&gt;Backoff retries; version-stamp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parallel Processing&lt;/td&gt;
&lt;td&gt;Partial LLM timeout&lt;/td&gt;
&lt;td&gt;Min success threshold&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning&lt;/td&gt;
&lt;td&gt;Context overflow / loops&lt;/td&gt;
&lt;td&gt;Map-reduce; step budget cap&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aggregation&lt;/td&gt;
&lt;td&gt;Component failure&lt;/td&gt;
&lt;td&gt;Partial result with metadata&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cancellation&lt;/td&gt;
&lt;td&gt;Mid-workflow abort&lt;/td&gt;
&lt;td&gt;Propagate signal; persist partial state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Delivery&lt;/td&gt;
&lt;td&gt;Malformed output&lt;/td&gt;
&lt;td&gt;Pre-delivery schema check&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Persistence&lt;/td&gt;
&lt;td&gt;Stale context / duplicate run&lt;/td&gt;
&lt;td&gt;TTL policy; idempotency check&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Design principle:&lt;/strong&gt; Partial success with honest metadata beats a hard failure every time. Build for the broken path — the happy path takes care of itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Final Thoughts&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;With LLMs in the picture, APIs are no longer just interfaces—they're becoming part of systems that can interpret intent, reason through tasks, and coordinate execution dynamically.&lt;/p&gt;

&lt;p&gt;At its core, this article highlights a shift in how we design backends:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;From &lt;strong&gt;deterministic endpoints → intent-driven systems&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;From &lt;strong&gt;static workflows → dynamic orchestration&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;From &lt;strong&gt;stateless APIs → memory-aware architectures&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;From &lt;strong&gt;hardcoded logic → model-assisted decision making&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;"REST isn't evolving. The runtime behind your endpoint is being replaced"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most will feel this shift not as a clean architectural migration, but as accumulated pressure: timeouts that don't make sense, costs that don't map to load, failures that don't reproduce.&lt;/p&gt;

&lt;p&gt;The harder question is: &lt;strong&gt;does your current backend infrastructure support what you're asking it to do?&lt;/strong&gt; Not the endpoint. Not the framework. The runtime — the orchestration, the memory, the failure recovery, the cost model.&lt;/p&gt;

&lt;p&gt;If the answer is uncertain, that uncertainty is the signal. Start there!!!&lt;/p&gt;

</description>
      <category>python</category>
      <category>fastapi</category>
      <category>ai</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
