<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Portkey</title>
    <description>The latest articles on Forem by Portkey (@portkey).</description>
    <link>https://forem.com/portkey</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F8056%2F39f656e2-eff4-4755-b84b-44d61d7c98b5.jpg</url>
      <title>Forem: Portkey</title>
      <link>https://forem.com/portkey</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/portkey"/>
    <language>en</language>
    <item>
      <title>Claude Code agents: what they are, how they work, and how to scale them</title>
      <dc:creator>Drishti Shah</dc:creator>
      <pubDate>Mon, 09 Mar 2026 14:57:31 +0000</pubDate>
      <link>https://forem.com/portkey/claude-code-agents-what-they-are-how-they-work-and-how-to-scale-them-15n0</link>
      <guid>https://forem.com/portkey/claude-code-agents-what-they-are-how-they-work-and-how-to-scale-them-15n0</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fck3t9iaujrlzd1ybm3ew.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fck3t9iaujrlzd1ybm3ew.png" alt="Claude Code agents: what they are, how they work, and how to scale them"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Claude Code is now the most widely used AI coding agent. We all know what it does. The harder question is what happens when you roll it out across a team of 20, 50, or 200 engineers, each running agentic loops that spawn subagents, call MCP tools, and burn through tokens autonomously.&lt;/p&gt;

&lt;p&gt;This post covers the agent architecture inside Claude Code, what breaks when you take it to production, and how to add the governance and observability layer that enterprises need.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Claude Code agents work under the hood
&lt;/h2&gt;

&lt;p&gt;Claude Code runs a straightforward agent loop: the model produces a message, and if it includes a tool call, the tool executes and results feed back into the model. No tool call means the loop stops and the agent waits for input. It ships with around 14 tools spanning file operations, shell commands, web access, and control flow.&lt;/p&gt;

&lt;p&gt;What makes the system powerful is the layered agent architecture sitting on top of that loop.&lt;/p&gt;

&lt;h3&gt;
  
  
  Subagents
&lt;/h3&gt;

&lt;p&gt;Claude Code delegates to specialized subagents that each run in their own context window. The built-in ones include Explore (read-only codebase search), Plan (research for planning mode), and a general-purpose agent for complex multi-step tasks. Each subagent has restricted tool access and independent permissions.&lt;/p&gt;

&lt;p&gt;The real value here is context management. Agentic tasks fill up the context window fast. Subagents keep exploration and implementation out of the main conversation, preserving the primary context for decision-making.&lt;/p&gt;

&lt;p&gt;You can also define custom subagents as markdown files with YAML frontmatter, scoped to specific tools, models, and system prompts. A read-only code reviewer, a documentation generator, a security auditor, each with exactly the permissions it needs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Agent teams
&lt;/h3&gt;

&lt;p&gt;Subagents work within a single session. Agent teams coordinate across separate sessions, enabling parallel workflows. Think: one agent building a backend API while another builds the frontend, each in an isolated Git worktree.&lt;/p&gt;

&lt;p&gt;Anthropic's own 2026 trends report names multi-agent coordination as a top priority for engineering teams this year.&lt;/p&gt;

&lt;h3&gt;
  
  
  Skills, hooks, and the Agent SDK
&lt;/h3&gt;

&lt;p&gt;Skills are reusable instruction packages that Claude loads automatically when it encounters a matching task. Hooks trigger actions at specific workflow points (run tests after changes, lint before commits). The Claude Agent SDK gives teams the same underlying tools and permissions framework to build custom agent experiences outside the terminal entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  What breaks at scale
&lt;/h2&gt;

&lt;p&gt;None of these capabilities ship with the operational scaffolding that production environments demand. Here is where teams consistently run into trouble.&lt;/p&gt;

&lt;h3&gt;
  
  
  Runaway costs with no attribution
&lt;/h3&gt;

&lt;p&gt;A single complex task can spawn multiple subagents, each making dozens of tool calls across extended agentic loops. Multiply that by a team of engineers, and token spend becomes unpredictable. &lt;a href="https://portkey.ai/for/claude-code?ref=portkey.ai" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt; has no native cost tracking by user, team, or project. The first monthly bill is usually the wake-up call.&lt;/p&gt;

&lt;h3&gt;
  
  
  Zero observability
&lt;/h3&gt;

&lt;p&gt;There is no built-in logging, tracing, or audit trail. You cannot see what prompts were run, which subagents were spawned, how long requests took, or where errors occurred. For debugging, optimization, and compliance, this is a non-starter.&lt;/p&gt;

&lt;h3&gt;
  
  
  Single-provider fragility
&lt;/h3&gt;

&lt;p&gt;Claude Code ties you to one provider. If that provider hits rate limits, has an outage, or is unavailable in a region you need, your workflows stop. Switching providers requires manual reconfiguration.&lt;/p&gt;

&lt;h3&gt;
  
  
  No access control
&lt;/h3&gt;

&lt;p&gt;There is no centralized way to manage API keys, restrict access by role or team, enforce rate limits, or set budget caps. Every developer's setup is independent, which makes governance at scale nearly impossible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ungoverned MCP tool access
&lt;/h3&gt;

&lt;p&gt;Claude Code supports MCP for connecting to external tools like GitHub, Slack, databases, and internal APIs. When agents can take real actions through these tools, the question is not whether MCP works but whether it should be allowed, for which agent, with what permissions, and who is watching. Without a governance layer, MCP adoption stalls at experimentation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adding governance with an AI gateway
&lt;/h2&gt;

&lt;p&gt;An AI gateway sits between Claude Code and the model providers, adding observability, access control, and reliability without changing how developers use the tool. Portkey is built for exactly this.&lt;/p&gt;

&lt;h3&gt;
  
  
  Team-level governance
&lt;/h3&gt;

&lt;p&gt;Isolate teams with separate workspaces, each with its own budget, rate limits, and access controls. Manage API keys centrally instead of distributing raw keys to individual developers. Enforce org-wide policies without building custom wrappers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-provider routing
&lt;/h3&gt;

&lt;p&gt;Route Claude Code requests across Anthropic, Bedrock, and Vertex AI through a single endpoint. Define fallback logic, add load balancing or define conditions using meta data for smarter routing. Developers change nothing in their workflow. The gateway handles it.&lt;/p&gt;

&lt;p&gt;Just add this to &lt;em&gt;settings.json&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ANTHROPIC_BASE_URL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://api.portkey.ai"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ANTHROPIC_AUTH_TOKEN"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"YOUR_PORTKEY_API_KEY"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ANTHROPIC_CUSTOM_HEADERS"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"x-portkey-api-key: YOUR_PORTKEY_API_KEY&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;x-portkey-provider: @anthropic-prod"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Cost tracking and budget controls
&lt;/h3&gt;

&lt;p&gt;Every request is logged with token usage and cost, broken down by provider, team, user, and project. Set hard &lt;a href="https://portkey.ai/docs/product/enterprise-offering/budget-policies?ref=portkey.ai" rel="noopener noreferrer"&gt;budget limits&lt;/a&gt; per team or developer. Get alerts when spend crosses thresholds. Route to cheaper providers when cost matters more than latency.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhl90ib0wry3y4dyfrww7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhl90ib0wry3y4dyfrww7.png" alt="Claude Code agents: what they are, how they work, and how to scale them"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For teams running parallel agent fleets, this is the difference between controlled scaling and a surprise invoice.&lt;/p&gt;

&lt;h3&gt;
  
  
  Full observability
&lt;/h3&gt;

&lt;p&gt;Portkey's OTEL-compliant &lt;a href="https://portkey.ai/features/observability?ref=portkey.ai" rel="noopener noreferrer"&gt;observability&lt;/a&gt; captures every request with metadata: latency, tokens, error rates, provider, route, and custom tags. Filter and search across all Claude Code usage in one dashboard. Debug failed agent loops, monitor MCP tool calls, and track performance trends over time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7dkz9gyb53vkumszplvn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7dkz9gyb53vkumszplvn.png" alt="Claude Code agents: what they are, how they work, and how to scale them"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Guardrails
&lt;/h3&gt;

&lt;p&gt;Enforce PII detection, content filtering, prompt injection protection, and token limits before requests reach the model. This matters especially for agentic workflows processing sensitive codebases or interacting with production systems through MCP.&lt;/p&gt;

&lt;h3&gt;
  
  
  MCP Gateway
&lt;/h3&gt;

&lt;p&gt;Portkey's MCP Gateway provides centralized authentication, role-based access control, and full observability for MCP tool connections. Agents authenticate once through Portkey. The gateway handles credential injection, permission checks, and request logging for every configured MCP server.&lt;/p&gt;

&lt;p&gt;Platform teams control exactly which teams can access which MCP servers and tools. Every tool call is logged with full context: who accessed what, when, and with what parameters. This is the layer that makes MCP safe for production use.&lt;/p&gt;

&lt;h3&gt;
  
  
  Putting it together
&lt;/h3&gt;

&lt;p&gt;The shift happening right now is clear. Claude Code agents are moving from individual developer tools to team-level infrastructure. Subagents, agent teams, skills, MCP integrations, and orchestration layers are making agents dramatically more capable. But capability without governance is a liability.&lt;/p&gt;

&lt;p&gt;The teams scaling Claude Code successfully are treating it as an infrastructure problem: routing through a gateway, tracking every request, setting budget boundaries, and governing tool access centrally. The developers keep typing &lt;code&gt;claude&lt;/code&gt; in their terminal. The platform team gets the visibility and control they need.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;To run Claude Code agents with governance at scale,&lt;/em&gt; &lt;a href="https://app.portkey.ai/?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;em&gt;get started with Portkey&lt;/em&gt;&lt;/a&gt; &lt;em&gt;or&lt;/em&gt; &lt;a href="https://portkey.sh/blog?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;em&gt;book a demo&lt;/em&gt;&lt;/a&gt; &lt;em&gt;for a walkthrough.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
    </item>
    <item>
      <title>LLM Deployment Pipeline Explained Step by Step</title>
      <dc:creator>Drishti Shah</dc:creator>
      <pubDate>Fri, 27 Feb 2026 11:41:00 +0000</pubDate>
      <link>https://forem.com/portkey/llm-deployment-pipeline-explained-step-by-step-6g6</link>
      <guid>https://forem.com/portkey/llm-deployment-pipeline-explained-step-by-step-6g6</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk4buzfcfhmlp6clv89t0.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk4buzfcfhmlp6clv89t0.jpg" alt="LLM Deployment Pipeline Explained Step by Step" width="800" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;LLM deployment is the process of taking a trained language model and converting it into a production service that can handle live user requests reliably and at scale. &lt;/p&gt;

&lt;p&gt;In practice, &lt;strong&gt;a truly production-ready LLM system is shaped by five interconnected layers: containerization, infrastructure and GPU allocation, the API and serving layer, autoscaling, and monitoring&lt;/strong&gt;. These layers keep performance stable, costs predictable, and outputs trustworthy as real traffic flows in.&lt;/p&gt;

&lt;p&gt;💡 Gartner estimates that &lt;a href="https://www.gartner.com/en/newsroom/press-releases/2023-10-11-gartner-says-more-than-80-percent-of-enterprises-will-have-used-generative-ai-apis-or-deployed-generative-ai-enabled-applications-by-2026?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;u&gt;more than 80% of enterprises will have generative AI applications in production by 2026&lt;/u&gt;&lt;/a&gt;, yet most online tutorials still stop at “get it running” and ignore what happens after.&lt;/p&gt;

&lt;p&gt;Don’t worry, though, because this article covers the full lifecycle teams struggle with post-launch, including scaling predictably under real demand, monitoring probabilistic outputs that can fail silently, and controlling costs that compound with every request.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Cloud APIs, self-hosted, or on-premises&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Every LLM deployment begins with a foundational architectural decision: consume models via cloud APIs, run them yourself on cloud GPUs, or operate fully on-premises. The right choice depends on how you balance speed, control, cost, and compliance.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Architecture path&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;What it offers&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;Tradeoffs&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;Typical use case&lt;/p&gt;

&lt;p&gt;|&lt;br&gt;
| &lt;/p&gt;

&lt;p&gt;Cloud APIs (&lt;a href="https://openai.com/?ref=portkey.ai" rel="noopener noreferrer"&gt;OpenAI&lt;/a&gt;, &lt;a href="https://www.anthropic.com/?ref=portkey.ai" rel="noopener noreferrer"&gt;Anthropic&lt;/a&gt;, &lt;a href="https://azure.microsoft.com/en-us?ref=portkey.ai" rel="noopener noreferrer"&gt;Microsoft Azure&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;Fastest route to production, no infrastructure management, elastic scaling&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;Pay-per-token costs compound, vendor lock-in risk, data residency constraints&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;Prototyping and early production&lt;/p&gt;

&lt;p&gt;|&lt;br&gt;
| &lt;/p&gt;

&lt;p&gt;Self-hosted on cloud GPUs&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;Full control over open models (&lt;a href="https://www.llama.com/?ref=portkey.ai" rel="noopener noreferrer"&gt;Llama&lt;/a&gt;, &lt;a href="https://mistral.ai/?ref=portkey.ai" rel="noopener noreferrer"&gt;Mistral&lt;/a&gt;), performance tuning&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;You own reliability, scaling, and ops complexity&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;Teams needing flexibility and customization&lt;/p&gt;

&lt;p&gt;|&lt;br&gt;
| &lt;/p&gt;

&lt;p&gt;On-premises&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;Strong compliance posture (HIPAA, GDPR), predictable cost at high utilization&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;Hardware expense, deep infrastructure expertise required&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;Regulated or high-volume workloads&lt;/p&gt;

&lt;p&gt;|&lt;/p&gt;

&lt;p&gt;For self-hosting, infrastructure costs typically range from about &lt;a href="https://www.whitefiber.com/compare/best-gpus-for-llm-inference-in-2025?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;u&gt;$0.75/hour&lt;/u&gt;&lt;/a&gt; for an L4 GPU up to &lt;a href="https://docs.jarvislabs.ai/blog/h100-price?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;u&gt;$3.25/hour&lt;/u&gt;&lt;/a&gt; for an H100, before orchestration and redundancy. On-premises shifts that spend into upfront hardware investment that only pays off with sustained utilization.&lt;/p&gt;

&lt;p&gt;Across all three paths, data residency and security requirements often become the deciding factor.&lt;/p&gt;

&lt;p&gt;Also, keep in mind that according to &lt;a href="https://portkey.ai/llms-in-prod-25?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;u&gt;Portkey’s LLMs in Prod 2025&lt;/u&gt;&lt;/a&gt;, 40% of teams now use multiple LLM providers, up from 23% ten months earlier. As a result, you have to carefully consider how your architectural choice will affect portability and long-term leverage to avoid the risk of vendor lock-in. You can also use Portkey’s AI Gateway to mitigate that risk by routing across 1,600+ models through a single API, so switching providers becomes a configuration change rather than a rewrite.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;GPU selection and infrastructure sizing&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;LLMs only run fast when everything fits inside GPU memory (VRAM). That includes not just the model itself, but also the temporary memory it uses to hold conversation context while generating responses, called the &lt;a href="https://research.aimultiple.com/llm-quantization/?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;u&gt;KV cache&lt;/u&gt;&lt;/a&gt;. If either spills into normal system memory, performance drops sharply.&lt;/p&gt;

&lt;p&gt;This is why sizing GPUs by model size alone often fails in production. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A 70B model (FP16) needs about 140GB for weights. If you add a standard 8k context window and batch headroom, total VRAM &lt;a href="https://apxml.com/posts/ultimate-system-requirements-llama-3-models?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;u&gt;reaches ~161GB&lt;/u&gt;&lt;/a&gt;, requiring two 80GB GPUs (like the H100) or a single H200.&lt;/li&gt;
&lt;li&gt;An 8B model (FP16) uses roughly 16GB for weights and &lt;a href="https://apxml.com/models/llama-3-1-8b?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;u&gt;~18.4GB&lt;/u&gt;&lt;/a&gt; with context overhead – an ideal fit for one 24GB L4 GPU.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, always budget VRAM for both the model and its live working memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Inference frameworks and API endpoint design&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The efficiency of an LLM deployment is defined by the request lifecycle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;During prefill, the full prompt is processed in parallel and is primarily compute-bound. &lt;/li&gt;
&lt;li&gt;During decode, tokens are generated one by one and become memory-bandwidth bound. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For this, you’ll need an inference framework to manage the KV cache so the GPU never sits idle between these phases.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Framework&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;Key innovation&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;2026 performance edge&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;Best for&lt;/p&gt;

&lt;p&gt;|&lt;br&gt;
| &lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.vllm.ai/en/latest/?ref=portkey.ai" rel="noopener noreferrer"&gt;vLLM&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;PagedAttention&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;Industry standard; highest stability for general APIs.&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;High-concurrency production APIs.&lt;/p&gt;

&lt;p&gt;|&lt;br&gt;
| &lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.google.com/search?q=https%3A%2F%2Fgithub.com%2Fsglang-project%2Fsglang&amp;amp;ref=portkey.ai" rel="noopener noreferrer"&gt;SGLang&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;RadixAttention&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;&lt;a href="https://research.aimultiple.com/inference-engines/?ref=portkey.ai" rel="noopener noreferrer"&gt;29% higher throughput&lt;/a&gt; than vLLM; instant prefix caching.&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;Agentic and RAG workloads.&lt;/p&gt;

&lt;p&gt;|&lt;br&gt;
| &lt;/p&gt;

&lt;p&gt;&lt;a href="https://nvidia.github.io/TensorRT-LLM/?ref=portkey.ai" rel="noopener noreferrer"&gt;TensorRT-LLM&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;Compiled kernels&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;Lowest latency (up to 2x faster TTFT on NVIDIA).&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;Latency-critical systems.&lt;/p&gt;

&lt;p&gt;|&lt;br&gt;
| &lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/InternLM/lmdeploy?ref=portkey.ai" rel="noopener noreferrer"&gt;LMDeploy&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;TurboMind C++ engine&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;&lt;a href="https://research.aimultiple.com/inference-engines/?ref=portkey.ai" rel="noopener noreferrer"&gt;Rivals SGLang&lt;/a&gt;; excellent 4-bit/8-bit throughput&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;High-volume batch inference.&lt;/p&gt;

&lt;p&gt;|&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The hierarchy of metrics&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;First is &lt;strong&gt;TTFT (Time to First Token)&lt;/strong&gt;, which is how quickly users see the first word appear – the &lt;strong&gt;most critical UX signal&lt;/strong&gt;. With superior cache reuse, &lt;a href="https://rawlinson.ca/articles/vllm-vs-sglang-performance-benchmark-h100?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;u&gt;SGLang achieves roughly 3.7× faster TTFT at low concurrency&lt;/u&gt;&lt;/a&gt;, making it ideal for highly interactive agents.&lt;/p&gt;

&lt;p&gt;Next is &lt;strong&gt;TPS (Tokens Per Second)&lt;/strong&gt;, which defines total system capacity. While vLLM is commonly the default choice, &lt;a href="https://www.reddit.com/r/LocalLLaMA/comments/1odp8pe/sglang_vs_vllm_on_h200_which_one_do_you_prefer/?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;u&gt;SGLang can deliver about 33% higher TPS&lt;/u&gt;&lt;/a&gt; in multi-turn conversations by avoiding repeated context processing.&lt;/p&gt;

&lt;p&gt;Then comes &lt;strong&gt;KV cache utilization&lt;/strong&gt; , which determines stability. Modern inference engines target around &lt;a href="https://research.aimultiple.com/inference-engines/?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;u&gt;80% GPU memory usage&lt;/u&gt;&lt;/a&gt; to balance throughput and safety, while pushing toward 95% can increase the risk of instability and trigger crashes during graph capture and runtime spikes.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;API design for production UX&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Streaming&lt;/strong&gt; is mandatory. Total generation time matters less than TTFT – users tolerate a 10-second response if the first token appears in ~200ms.&lt;/p&gt;

&lt;p&gt;Also, &lt;strong&gt;continuous batching&lt;/strong&gt; prevents queue bottlenecks. Your framework must allow new requests to join an active decode loop, rather than waiting in a strict first-in-first-out pipeline.&lt;/p&gt;

&lt;p&gt;For RAG systems, &lt;strong&gt;prefix caching&lt;/strong&gt; is critical. By caching the system prompt and retrieved documents, you can reduce prefill latency and cost by &lt;a href="https://blog.gopenai.com/sglang-vs-vllm-the-new-throughput-king-7daec596f7fa?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;u&gt;up to 90% for returning users&lt;/u&gt;&lt;/a&gt;, dramatically improving both responsiveness and efficiency.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Scaling strategies that actually work for LLMs&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Traditional autoscaling based on CPU or memory utilization fails for LLM workloads. Inference is primarily memory-bandwidth and I/O bound, so scaling decisions must reflect user-facing latency and GPU saturation, not generic resource metrics.&lt;/p&gt;

&lt;p&gt;For Kubernetes deployments, configure the Horizontal Pod Autoscaler (HPA) around three signals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Queue depth (num_requests_waiting) – the most responsive signal:&lt;/strong&gt; Best practice today is to trigger scale-out when just 3–5 requests begin to queue, preventing latency spikes before users feel them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;P90 TTFT (Time to First Token) – the SLA-level experience metric:&lt;/strong&gt; If the 90th percentile drifts beyond roughly 200–500ms, new replicas should spin up to preserve a fast, conversational feel.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU KV cache utilization (gpu_cache_usage_perc) – shows physical saturation:&lt;/strong&gt; High cache usage combined with a growing queue means existing pods can’t accept more tokens without instability or crashes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Additionally, cold starts were once the biggest constraint, with model loads taking 30–120 seconds. However, modern streaming loaders have reset that baseline. Using &lt;a href="https://developer.nvidia.com/blog/reducing-cold-start-latency-for-llm-inference-with-nvidia-runai-model-streamer/?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;u&gt;NVIDIA’s Run:ai Model Streamer&lt;/u&gt;&lt;/a&gt;, a 7B model can stream from object storage into GPU memory in 5–15 seconds. When paired with frameworks like vLLM, total Time to Ready (container spin-up plus engine initialization) is now under 25 seconds, reducing the need for costly warm pools.&lt;/p&gt;

&lt;p&gt;In practice, Kubernetes users should avoid default CPU-based scaling rules. Instead, configure the HPA to scale using queue depth and TTFT, since those directly reflect user experience and system saturation.&lt;/p&gt;

&lt;p&gt;Scaling at the infrastructure layer is only part of the solution. Above it, the &lt;a href="https://portkey.ai/features/ai-gateway?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;u&gt; AI Gateway&lt;/u&gt;&lt;/a&gt; adds production safeguards such as automatic retries, exponential backoff, circuit breakers, and provider failover. These mechanisms prevent temporary overloads or upstream failures from cascading into user-visible downtime.&lt;/p&gt;

&lt;p&gt;With the Gateway, &lt;a href="https://portkey.ai/case-studies/leading-delivery-platform?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;u&gt;a leading online food delivery platform successfully handled a 3,100× traffic increase&lt;/u&gt;&lt;/a&gt;, peaking at 1,800 requests per second, while maintaining 99.99% uptime by combining HPA-based scaling with gateway-level resilience controls.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Monitoring LLMs in production&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;LLMs often fail silently, returning a successful status code (200 OK) while producing hallucinated or misleading outputs. Monitoring and observability are essential here: monitoring confirms the system is running, while observability confirms the response is actually correct. For this, production teams need a dual-layer telemetry approach.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1: Operational metrics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Operational metrics (the heartbeat of your inference stack) ensure the fleet is responsive and stable under load. The most important are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;P95 TTFT (Time to First Token) –  the gold standard for user experience:&lt;/strong&gt; In 2026, enterprise targets have tightened to 200–500ms to preserve a natural, conversational feel. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TPOT (Time Per Output Token):&lt;/strong&gt; This measures streaming smoothness, with a goal of under 50ms so responses outpace human reading speed. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KV cache saturation – an early warning system:&lt;/strong&gt; Rising cache usage paired with growing queues signals an impending generation stall or crash long before errors appear.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Layer 2: Semantic observability&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Semantic observability evaluates whether answers are actually correct and safe. Using LLM-as-a-Judge techniques, teams can detect quality drift that traditional APM tools miss:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Faithfulness and grounding&lt;/strong&gt; verify that responses truly exist in retrieved documents, forming the core guardrail for RAG systems. &lt;/li&gt;
&lt;li&gt;
&lt;a href="https://portkey.ai/blog/the-complete-guide-to-llm-observability/" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;Cost harvesting&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt; attacks track abusive prompts designed to explode token usage and drain budgets – a growing 2026 threat. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tone and bias&lt;/strong&gt; drift monitor whether the model’s personality or alignment shifts subtly across millions of requests.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Modern platforms now unify these layers. For instance, Portkey provides observability directly within its AI Gateway, capturing over 40 data points per request across cost, performance, and accuracy – earning recognition as a &lt;a href="https://www.gartner.com/en/documents/7024598?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;u&gt;Gartner Cool Vendor in LLM Observability (2025)&lt;/u&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;👀 Typical production SLOs in 2026 include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;99.99% availability.&lt;/li&gt;
&lt;li&gt;P95 TTFT of 200–500ms.&lt;/li&gt;
&lt;li&gt;P50 end-to-end RAG latency under 1.5 seconds.&lt;/li&gt;
&lt;li&gt;Hallucination rates below 1%, depending on domain risk tolerance.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Controlling costs as you scale&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;LLM costs grow in a fundamentally different way than traditional infrastructure. Instead of paying mainly for reserved compute, spending increases with usage volume, since every request consumes tokens. What looks like a few cents per call can quickly become a major expense at millions of requests per month.&lt;/p&gt;

&lt;p&gt;To manage this, teams must start with cost attribution – tracking spend by feature, team, and user. Without this visibility, optimization happens too late, after budgets have already been exceeded.&lt;/p&gt;

&lt;p&gt;Once costs are visible, three levers drive meaningful savings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Semantic caching&lt;/strong&gt; avoids recomputing responses for repeated or similar queries. For instance, &lt;a href="https://portkey.ai/blog/reducing-llm-costs-and-latency-semantic-cache/" rel="noopener noreferrer"&gt;&lt;u&gt;semantic cache&lt;/u&gt;&lt;/a&gt; delivers around 20% average cache hit rates in early production and up to 60% in focused RAG systems, with roughly 20× faster responses at near-zero cost on cached requests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intelligent routing&lt;/strong&gt; sends simpler requests to cheaper, faster models while reserving premium models for complex reasoning. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch processing&lt;/strong&gt; , such as &lt;a href="https://developers.openai.com/api/docs/guides/batch?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;u&gt;OpenAI’s Batch API&lt;/u&gt;&lt;/a&gt;, reduces costs for large, non-urgent workloads by grouping requests into lower-priced bulk jobs with flexible completion windows.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you’re looking for a way to maintain control over costs, opt for Portkey’s AI Gateway. One delivery platform &lt;a href="https://portkey.ai/case-studies/leading-delivery-platform?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;u&gt;saved $500,000 through optimized routing and caching via Portkey&lt;/u&gt;&lt;/a&gt; while maintaining 99.99% uptime across billions of requests.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Building your deployment pipeline&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;As you’ve seen, LLM deployment is an operational pipeline made up of infrastructure, serving, scaling, monitoring, and cost control, with each layer compounding the reliability and efficiency of the next.&lt;/p&gt;

&lt;p&gt;If your models are already running but the operational layer is holding you back, this is where an AI gateway like Portkey fits. &lt;/p&gt;

&lt;p&gt;Portkey sits above your inference stack, providing intelligent routing across providers, built-in observability, semantic caching, cost controls, and automatic failover, so you can scale reliably without building custom infrastructure.&lt;/p&gt;

&lt;p&gt;Ready to move from “model deployed” to truly production-ready? &lt;a href="https://portkey.ai/features/ai-gateway?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;u&gt;Try Portkey’s AI gateway&lt;/u&gt;&lt;/a&gt; now and turn your LLMs into reliable, cost-efficient production systems!&lt;/p&gt;


&lt;p&gt;&amp;lt;!--kg-card-begin: html--&amp;gt;&lt;/p&gt;Full Width CTA&amp;lt;br&amp;gt;&lt;br&gt;
        * {&amp;lt;br&amp;gt;&lt;br&gt;
            margin: 0;&amp;lt;br&amp;gt;&lt;br&gt;
            padding: 0;&amp;lt;br&amp;gt;&lt;br&gt;
            box-sizing: border-box;&amp;lt;br&amp;gt;&lt;br&gt;
        }&amp;lt;/p&amp;gt;&lt;br&gt;
&amp;lt;div class="highlight"&amp;gt;&amp;lt;pre class="highlight plaintext"&amp;gt;&amp;lt;code&amp;gt;    body {&lt;br&gt;
        font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;&lt;br&gt;
        line-height: 1.6;&lt;br&gt;
        color: #333;&lt;br&gt;
        background: #f5f5f5;&lt;br&gt;
        padding: 40px 20px;&lt;br&gt;
        display: flex;&lt;br&gt;
        align-items: center;&lt;br&gt;
        justify-content: center;&lt;br&gt;
        min-height: 100vh;&lt;br&gt;
    }
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.container {
    max-width: 1200px;
    width: 100%;
}

/* Full Width CTA */
.cta-full {
    text-align: center;
    padding: 60px 40px;
    background: #000;
    color: white;
    border-radius: 12px;
}

.cta-full h2 {
    font-size: 36px;
    margin-bottom: 16px;
}

.cta-full p {
    font-size: 18px;
    margin-bottom: 32px;
    opacity: 0.9;
}

/* Button Styles */
.btn-primary {
    display: inline-block;
    background: white;
    color: #000;
    padding: 14px 32px;
    border-radius: 6px;
    text-decoration: none;
    font-weight: 500;
    transition: all 0.2s ease;
    border: none;
    cursor: pointer;
    font-size: 16px;
}

.btn-primary:hover {
    background: #f0f0f0;
    transform: translateY(-1px);
    box-shadow: 0 4px 12px rgba(255,255,255,0.15);
}

.btn-secondary {
    display: inline-block;
    background: transparent;
    color: white;
    padding: 14px 32px;
    border-radius: 6px;
    text-decoration: none;
    font-weight: 500;
    transition: all 0.2s ease;
    border: 2px solid white;
    cursor: pointer;
    font-size: 16px;
}

.btn-secondary:hover {
    background: white;
    color: #000;
}

.btn-group {
    display: flex;
    gap: 16px;
    flex-wrap: wrap;
    justify-content: center;
}

@media (max-width: 768px) {
    .cta-full h2 {
        font-size: 28px;
    }

    .cta-full p {
        font-size: 16px;
    }

    .btn-group {
        flex-direction: column;
        width: 100%;
        max-width: 300px;
        margin: 0 auto;
    }

    .btn-primary,
    .btn-secondary {
        width: 100%;
        text-align: center;
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&amp;amp;lt;/style&amp;amp;gt;&lt;br&gt;
&amp;lt;/code&amp;gt;&amp;lt;/pre&amp;gt;&amp;lt;/div&amp;gt;&amp;lt;h2&amp;gt;&lt;br&gt;
  &amp;lt;a name="ship-faster-with-ai" href="#ship-faster-with-ai" class="anchor"&amp;gt;&lt;br&gt;
  &amp;lt;/a&amp;gt;&lt;br&gt;
  Ship Faster with AI&lt;br&gt;
&amp;lt;/h2&amp;gt;&lt;/p&gt;

&lt;p&gt;&amp;lt;p&amp;gt;Everything you need to build, deploy, and scale AI applications&amp;lt;/p&amp;gt;&lt;/p&gt;

&lt;p&gt;&amp;lt;p&amp;gt;&amp;lt;a href="#"&amp;gt;Get Started&amp;lt;/a&amp;gt;&amp;lt;a href="#"&amp;gt;Book a Demo&amp;lt;/a&amp;gt;&amp;lt;/p&amp;gt;&lt;/p&gt;

&lt;p&gt;&amp;lt;!--kg-card-end: html--&amp;gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>devops</category>
      <category>llm</category>
    </item>
    <item>
      <title>The best approach to compare LLM outputs</title>
      <dc:creator>Drishti Shah</dc:creator>
      <pubDate>Tue, 24 Feb 2026 13:45:39 +0000</pubDate>
      <link>https://forem.com/portkey/the-best-approach-to-compare-llm-outputs-4am4</link>
      <guid>https://forem.com/portkey/the-best-approach-to-compare-llm-outputs-4am4</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3arsssydu4vizk4101oh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3arsssydu4vizk4101oh.png" alt="The best approach to compare LLM outputs" width="800" height="562"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once LLMs are in production, output quality stops being a subjective question and becomes an operational one. Teams are no longer asking &lt;em&gt;whether&lt;/em&gt; they need to evaluate outputs, but &lt;em&gt;how&lt;/em&gt; to do it reliably as systems evolve.&lt;/p&gt;

&lt;p&gt;Production systems can change frequently. Prompts are iterated on, models are swapped, routing logic is adjusted, and traffic patterns shift. In this environment, point-in-time judgments of quality are not useful. What matters is whether output quality is stable, improving, or degrading across these changes.&lt;/p&gt;

&lt;p&gt;A repeatable approach to measurable output quality gives teams a way to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Establish baselines that survive prompt and model updates&lt;/li&gt;
&lt;li&gt;Compare outputs across versions, providers, and configurations&lt;/li&gt;
&lt;li&gt;Detect regressions before they reach users&lt;/li&gt;
&lt;li&gt;Reason about trade-offs between quality, latency, and cost&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Is manual review and one-off prompting enough to compare LLM outputs?
&lt;/h2&gt;

&lt;p&gt;Manual review and ad-hoc prompting still play a role in mature LLM systems, but they stop being sufficient once scale and change are introduced.&lt;/p&gt;

&lt;p&gt;Spot-checking outputs is inherently narrow. Reviewers see a small slice of traffic, often curated or synthetic, and their judgments are shaped by context that may not exist in real usage. As prompts and models evolve, those reviews quickly become outdated. What was “approved” last week may no longer reflect what users are seeing today.&lt;/p&gt;

&lt;p&gt;One-off prompting has similar limitations. Testing a handful of inputs against a new model or prompt can reveal obvious failures, but it does not capture variance. Non-determinism means two runs of the same prompt can produce meaningfully different outputs. Without repetition and aggregation, it’s impossible to tell whether a change actually improved quality or simply produced a better-looking example.&lt;/p&gt;

&lt;p&gt;At scale, manual review also introduces inconsistency. Different reviewers apply different standards, and those standards shift over time. There is no durable baseline, no reliable way to compare runs across days or deployments, and no systematic way to surface regressions early.&lt;/p&gt;

&lt;p&gt;Manual review remains useful for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Calibrating evaluation criteria&lt;/li&gt;
&lt;li&gt;Investigating edge cases&lt;/li&gt;
&lt;li&gt;Providing high-quality labels for difficult tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But as the primary mechanism for comparing LLM outputs, it breaks down. Production systems need methods that are repeatable, aggregatable, and resilient to change.&lt;/p&gt;

&lt;h2&gt;
  
  
  Metrics to consider when comparing LLM outputs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The core dimensions:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Effective LLM evaluation requires metrics that map to the actual risks and goals of your application. Most production systems care about three primary dimensions:   &lt;/p&gt;

&lt;p&gt;_1. whether the output is factually grounded (hallucination detection),   &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;whether it addresses what the user actually asked (relevance and correctness), and
&lt;/li&gt;
&lt;li&gt;whether it meets safety and quality standards (toxicity, coherence, formatting)._&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These dimensions are not interchangeable. A response can be highly relevant but factually wrong, or factually correct but unhelpful. Choosing which metrics to track depends on what failure modes matter most for your use case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deterministic vs. LLM-based metrics:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Metrics fall into two categories: &lt;em&gt;deterministic and model-based&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Deterministic metrics like regex matching, JSON validation, and keyword presence are fast, cheap, and predictable. They work well for structured outputs or hard constraints.&lt;/p&gt;

&lt;p&gt;Model-based metrics use an LLM to judge another LLM's output, which is necessary for evaluating qualities like coherence, completeness, or whether an answer is actually helpful. LLM-as-a-judge approaches are more expensive and introduce their own variance, but they scale where human review cannot. In practice, most teams use both: deterministic checks for objective criteria, model-based evaluation for subjective quality.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://arize.com/llm-as-a-judge/?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvi5j9i7qtquyt434ioyy.png" alt="The best approach to compare LLM outputs" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choosing the right granularity:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Metrics also differ in scope.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Span-level evaluation scores individual LLM calls, useful for isolating which step in a pipeline is underperforming. &lt;/li&gt;
&lt;li&gt;Trace-level evaluation looks at the full chain of reasoning, which matters when correctness depends on multiple steps working together. &lt;/li&gt;
&lt;li&gt;Session-level evaluation captures user experience across a conversation, surfacing issues like frustration or confusion that only emerge over time. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Comparing outputs effectively means instrumenting at the right level and aggregating results in ways that surface actionable patterns, not just individual failures.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Arize approaches evals
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The evaluation loop:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Arize treats evaluation as an operational loop, not a one-time gate. Evaluations attach directly to traces, so every LLM call can be scored and explained in context. This means teams can run evaluations offline during development, testing prompt changes against datasets before deployment, and then transition the same evaluators to run online against production traffic. The result is a continuous feedback mechanism: evaluations surface regressions, traces provide the context to diagnose them, and experiments validate fixes before they ship.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-built and custom evaluators:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
The platform includes pre-built &lt;a href="https://portkey.ai/docs/guides/prompts/llm-as-a-judge?ref=portkey.ai#building-an-llm-as-a-judge-system-for-ai-customer-support-agent" rel="noopener noreferrer"&gt;LLM-as-a-judge&lt;/a&gt; evaluators for common concerns: hallucination, relevance, toxicity, summarization quality, code correctness, and user frustration, among others. These templates are benchmarked against golden datasets with known precision and recall, so teams can deploy them with confidence. For domain-specific needs, custom evaluators can be defined in plain language, describing what "good" looks like in the same way you would brief a human reviewer, or implemented as deterministic code for objective criteria. Evaluators are versioned and reusable across projects, which keeps evaluation criteria consistent as systems evolve.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Explainability and action:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Every evaluation produces not just a label or score, but an explanation. This matters because knowing that an output was marked "hallucinated" is less useful than understanding why: which claim was unsupported, which context was missing. Explanations make evaluations actionable: they point teams toward specific retrieval failures, prompt gaps, or model limitations. Combined with trace-level observability, this turns evaluation from a reporting mechanism into a diagnostic tool that accelerates iteration.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F98ck2eve4ts41xndbj0t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F98ck2eve4ts41xndbj0t.png" alt="The best approach to compare LLM outputs" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Completing the loop with Portkey: routing and orchestration for evaluations
&lt;/h2&gt;

&lt;p&gt;Running meaningful &lt;a href="https://portkey.ai/docs/guides/use-cases/run-batch-evals?ref=portkey.ai#how-to-run-structured-output-evals-at-scale" rel="noopener noreferrer"&gt;LLM evaluations&lt;/a&gt; requires more than scoring outputs in isolation. Teams need a reliable way to orchestrate how different models, prompts, and configurations are exercised so that comparisons are intentional and repeatable.&lt;/p&gt;

&lt;p&gt;By acting as a routing layer for LLM traffic, Portkey's &lt;a href="https://portkey.ai/features/ai-gateway?ref=portkey.ai" rel="noopener noreferrer"&gt;AI Gateway&lt;/a&gt;makes this orchestration straightforward. Multiple models and providers can be invoked through a single, consistent API, allowing teams to evaluate alternatives side by side without changing application code. The same inputs can be routed to different models, prompt versions, or configurations as part of structured evaluation runs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3mkker32fip2ch1qyl2y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3mkker32fip2ch1qyl2y.png" alt="The best approach to compare LLM outputs" width="800" height="408"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This becomes especially useful when evaluations move closer to production. Instead of maintaining separate pipelines for testing and live traffic, teams can reuse the same routing logic to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compare models under identical conditions&lt;/li&gt;
&lt;li&gt;Test prompt changes against real inputs&lt;/li&gt;
&lt;li&gt;Run controlled experiments alongside production traffic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Observability ensures that these evaluations remain interpretable. Every request captures the context needed to explain outcomes: which model was used, how it was routed, which prompt version was applied, and how the system performed in terms of latency and cost.&lt;/p&gt;

&lt;p&gt;Together, routing and observability turn evaluations into an operational loop. Results from Arize evaluations can be traced back to the exact model and routing decisions that produced them, making it easier to act on those insights and iterate with confidence.&lt;/p&gt;

&lt;p&gt;To try it out, &lt;a href="https://app.portkey.ai/?ref=portkey.ai" rel="noopener noreferrer"&gt;get started with Portkey&lt;/a&gt; for free (Portkey is open source!)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://portkey.ai/book-a-demo?utm_medium=blog&amp;amp;ref=portkey.ai" rel="noopener noreferrer"&gt;Book a demo&lt;/a&gt; with our experts if you're looking to deploy AI Agents in production!&lt;/p&gt;

</description>
      <category>aievals</category>
      <category>observability</category>
    </item>
    <item>
      <title>Open AI Responses API vs. Chat Completions vs. Anthropic Messages API</title>
      <dc:creator>Drishti Shah</dc:creator>
      <pubDate>Mon, 23 Feb 2026 14:27:32 +0000</pubDate>
      <link>https://forem.com/portkey/open-ai-responses-api-vs-chat-completions-vs-anthropic-messages-api-33h2</link>
      <guid>https://forem.com/portkey/open-ai-responses-api-vs-chat-completions-vs-anthropic-messages-api-33h2</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsuoszatlu11gfgohkqp3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsuoszatlu11gfgohkqp3.png" alt="Open AI Responses API vs. Chat Completions vs. Anthropic Messages API"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The LLM API landscape has never been more fragmented, or more consequential. As teams move from prototypes to production, the choice of which API format to build on shapes your vendor flexibility, your codebase complexity, and how quickly you can swap models when something better comes along.&lt;/p&gt;

&lt;p&gt;Today, three API formats dominate how AI Agents talk to LLMs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI's Chat Completions API&lt;/strong&gt; — the de facto standard, universally supported&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI's Responses API&lt;/strong&gt; — the newer, agent-oriented evolution with built-in tools and state management&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anthropic's Messages API&lt;/strong&gt; — Claude's native interface, with capabilities like extended thinking and prompt caching&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each was designed with different goals in mind. Understanding the differences affects how you build, how you scale, and how locked in you are to a single provider.&lt;/p&gt;

&lt;p&gt;Portkey supports all three natively, and that's where standardization starts to matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open AI Responses API vs. Chat Completions vs. Messages API: At a glance
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Chat Completions&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Responses API&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Messages API&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Provider&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OpenAI&lt;/td&gt;
&lt;td&gt;OpenAI&lt;/td&gt;
&lt;td&gt;Anthropic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Endpoint&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;POST /v1/chat/completions&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;POST /v1/responses&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;POST /v1/messages&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Design goal&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Stateless text generation&lt;/td&gt;
&lt;td&gt;Agentic workflows with built-in tools&lt;/td&gt;
&lt;td&gt;Claude-native capabilities&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;State management&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;Optional server-side (with &lt;code&gt;store: true&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Streaming&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tool / function calling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅ (with built-in tools)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Built-in web search&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅ (via server tools)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Extended thinking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅ (Claude only)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Prompt caching&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅ (with &lt;code&gt;cache_control&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Computer use&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ecosystem compatibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Widest&lt;/td&gt;
&lt;td&gt;Growing&lt;/td&gt;
&lt;td&gt;Claude-specific&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What makes each endpoint different
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Chat Completions: the universal standard
&lt;/h3&gt;

&lt;p&gt;Chat Completions (&lt;code&gt;POST /v1/chat/completions&lt;/code&gt;) is where everything started. You send an array of messages, each with a role (&lt;code&gt;system&lt;/code&gt;, &lt;code&gt;user&lt;/code&gt;, &lt;code&gt;assistant&lt;/code&gt;, or &lt;code&gt;tool&lt;/code&gt; for tool call results) and the model replies. It's stateless by design, so you own the conversation history and pass it with every request.&lt;/p&gt;

&lt;p&gt;This simplicity is its biggest strength. Because the model has no memory between calls, you have full control over what context it sees. And because practically every major provider has adopted this format, code written against Chat Completions works across OpenAI, Anthropic (via adapters), Gemini, Mistral, Bedrock, and other models with minimal changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it does well:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Widest ecosystem of tools, frameworks, and libraries&lt;/li&gt;
&lt;li&gt;Predictable, well-understood response format&lt;/li&gt;
&lt;li&gt;Easiest path to switching providers or running multi-provider setups&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What it doesn't do:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No built-in tools. Web search, code execution, file search all need external orchestration&lt;/li&gt;
&lt;li&gt;No native support for extended reasoning or prompt caching&lt;/li&gt;
&lt;li&gt;No server-side state, you manage conversation history entirely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The response object returns &lt;code&gt;choices&lt;/code&gt;, where each choice contains a &lt;code&gt;message&lt;/code&gt; with &lt;code&gt;role: "assistant"&lt;/code&gt; and &lt;code&gt;content&lt;/code&gt;. Tool calls come back in &lt;code&gt;tool_calls&lt;/code&gt;. Clean and predictable, which is why it became the lingua franca of LLM APIs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When to use it:&lt;/strong&gt; When your use case is primarily text generation i.e., chatbots, summarization, classification, content generation, Q&amp;amp;A. It's the right default if you're using frameworks like LangChain or LlamaIndex that abstract over providers, or if cross-provider portability matters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Responses API: built for agents
&lt;/h3&gt;

&lt;p&gt;The Responses API (&lt;code&gt;POST /v1/responses&lt;/code&gt;) takes a different approach. It's designed to run agentic loops as the model can call multiple built-in tools (web search, file search, code interpreter, computer use, remote MCP servers) within a single API request, without you orchestrating each step.&lt;/p&gt;

&lt;p&gt;State management comes in two forms. With &lt;code&gt;previous_response_id&lt;/code&gt;, you chain responses by referencing a prior response ID and the model picks up context without you resending the full history, but you're still tracking the ID yourself. The newer Conversations API goes further, maintaining a durable conversation object server-side that automatically accumulates turns across sessions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it does well:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Built-in tools that run within a single request, no external orchestration needed&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;previous_response_id&lt;/code&gt; chains turns without resending prior tokens&lt;/li&gt;
&lt;li&gt;Designed for multi-step agentic workflows where context and tool results accumulate&lt;/li&gt;
&lt;li&gt;Better cache utilization compared to Chat Completions for repeated context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What it doesn't do:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Natively available on OpenAI models only (though Portkey makes it work across providers)&lt;/li&gt;
&lt;li&gt;More complex response structure, &lt;code&gt;output&lt;/code&gt; is an array of typed items rather than a single message&lt;/li&gt;
&lt;li&gt;Overkill for simple single-turn completions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to use it:&lt;/strong&gt; When you're building autonomous agents that use built-in tools, or multi-turn workflows where you want to reduce token overhead across turns. It's the right choice when agentic behavior and tool use are core to your application.&lt;/p&gt;

&lt;h3&gt;
  
  
  Messages API: Claude's native interface
&lt;/h3&gt;

&lt;p&gt;Anthropic's Messages API (&lt;code&gt;POST /v1/messages&lt;/code&gt;) is designed around how Claude works. While it shares surface similarities with Chat Completions, it exposes capabilities that are specific to Claude and don't exist in OpenAI's formats.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it does well:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extended thinking:&lt;/strong&gt; Claude returns &lt;code&gt;type: "thinking"&lt;/code&gt; content blocks before the final answer, exposing its reasoning process. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt caching:&lt;/strong&gt; Fine-grained &lt;code&gt;cache_control&lt;/code&gt; lets you cache specific content blocks (with 5-minute or 1-hour TTLs), reducing latency and cost significantly for repeated context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rich content blocks:&lt;/strong&gt; The &lt;code&gt;content&lt;/code&gt; array supports text, images, PDFs, tool use, thinking blocks, and citations pointing to source documents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stop reason granularity:&lt;/strong&gt; &lt;code&gt;stop_reason&lt;/code&gt; can be &lt;code&gt;end_turn&lt;/code&gt;, &lt;code&gt;max_tokens&lt;/code&gt;, &lt;code&gt;stop_sequence&lt;/code&gt;, &lt;code&gt;tool_use&lt;/code&gt;, &lt;code&gt;pause_turn&lt;/code&gt;, or &lt;code&gt;refusal&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Native web search:&lt;/strong&gt; Pass &lt;code&gt;{"type": "web_search_20250305", "name": "web_search"}&lt;/code&gt; in the &lt;code&gt;tools&lt;/code&gt; array and Claude handles execution server-side, returning results in a &lt;code&gt;server_tool_use&lt;/code&gt; response block&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What it doesn't do:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No server-side state management (you manage history yourself)&lt;/li&gt;
&lt;li&gt;Not natively compatible with non-Anthropic providers without a translation layer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The response object returns a &lt;code&gt;content&lt;/code&gt; array of typed blocks. A single response might include a &lt;code&gt;thinking&lt;/code&gt; block, a &lt;code&gt;text&lt;/code&gt; block, and a &lt;code&gt;tool_use&lt;/code&gt; block in sequence. Citations on text blocks tell you exactly which document or character range the model drew from.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When to use it:&lt;/strong&gt; When you're building specifically on Claude and need extended thinking for complex reasoning, prompt caching for document-heavy workloads, or reasoning transparency in your application.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Portkey supports all three
&lt;/h2&gt;

&lt;p&gt;Most teams end up needing more than one of these formats. You might use Chat Completions for a general assistant, the Responses API for an autonomous agent, and Claude's Messages API for a document reasoning pipeline, all in the same &lt;a href="https://portkey.ai/features/agents?ref=portkey.ai" rel="noopener noreferrer"&gt;AI Agent&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Building direct integrations with each provider means separate SDKs, separate observability, and code that breaks every time you want to try a new model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Portkey sits between your application and every provider, handling the translation so you don't have to.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The best part: you can &lt;strong&gt;use any of the three API formats with any provider and model&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Want to use the Messages API format but route to a Gemini model? Portkey handles the transformation. Want Chat Completions format but call a Claude model? Same thing.&lt;/p&gt;

&lt;p&gt;Beyond format flexibility, Portkey adds what direct API access can't give you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Observability:&lt;/strong&gt; Every request logged, traced, and searchable across all providers and API formats&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fallbacks and load balancing:&lt;/strong&gt; Route to a backup provider if your primary is down or rate-limited&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt management:&lt;/strong&gt; Version, test, and deploy prompts centrally&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost tracking:&lt;/strong&gt; Unified spend view across providers and models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Governance:&lt;/strong&gt; Enterprise controls over which teams access which models&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Making calls with each API through Portkey
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Chat Completions
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;portkey_ai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Portkey&lt;/span&gt;

&lt;span class="n"&gt;portkey&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Portkey&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PORTKEY_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;portkey&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@openai-provider/gpt-5.2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain quantum computing in simple terms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Responses API
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;portkey_ai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Portkey&lt;/span&gt;

&lt;span class="n"&gt;portkey&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Portkey&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PORTKEY_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;portkey&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@openai-provider/gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain quantum computing in simple terms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Messages API
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PORTKEY_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.portkey.ai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@anthropic-provider/claude-sonnet-4-5-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain quantum computing in simple terms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Switching providers without changing your code
&lt;/h3&gt;

&lt;p&gt;The real payoff is when you want to swap providers.&lt;/p&gt;

&lt;p&gt;To move from OpenAI to Claude on Responses, all you need to do is change the provider name and model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;portkey_ai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Portkey&lt;/span&gt;

&lt;span class="n"&gt;portkey&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Portkey&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PORTKEY_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;portkey&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@anthropic-provider/claude-sonnet-4-5-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain quantum computing in simple terms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your application code, your observability, your fallback logic, none of it changes. The format stays the same and Portkey's &lt;a href="https://portkey.ai/docs/product/ai-gateway/universal-api?ref=portkey.ai" rel="noopener noreferrer"&gt;universal API&lt;/a&gt; handles the translation to whichever provider you're routing to.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting started
&lt;/h2&gt;

&lt;p&gt;All three API formats work through Portkey with a single configuration change, pointing your SDK's base URL at Portkey's gateway. From there, routing, translation, observability, and reliability are handled for you.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://portkey.ai/?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;strong&gt;Get started with Portkey&lt;/strong&gt;&lt;/a&gt; | &lt;a href="https://docs.portkey.ai/?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;strong&gt;Read the docs&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To see how explore how Portkey can support your AI strategy, &lt;a href="https://calendly.com/portkey-ai/quick-meeting?ref=portkey.ai" rel="noopener noreferrer"&gt;book a demo&lt;/a&gt; here.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>api</category>
      <category>llm</category>
      <category>openai</category>
    </item>
    <item>
      <title>Introducing the MCP Gateway!</title>
      <dc:creator>Drishti Shah</dc:creator>
      <pubDate>Wed, 21 Jan 2026 15:14:37 +0000</pubDate>
      <link>https://forem.com/portkey/introducing-the-mcp-gateway-2j4m</link>
      <guid>https://forem.com/portkey/introducing-the-mcp-gateway-2j4m</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fscakfgxlz97g8sonzjt2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fscakfgxlz97g8sonzjt2.png" alt="Introducing the MCP Gateway!" width="800" height="413"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Your interns need three approvals to touch production. Your AI agents? Zero.&lt;/p&gt;

&lt;p&gt;With MCP, agents can take &lt;em&gt;real&lt;/em&gt; action – ✅ connect to databases, ✅ trigger workflows, ✅ access internal systems. The protocol just works.&lt;/p&gt;

&lt;p&gt;What's missing is the operational layer that lets &lt;strong&gt;platform teams say yes&lt;/strong&gt; without losing control.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://www.reddit.com/r/automation/comments/1q6d1so/why_everyones_obsessed_with_mcp_right_now/" rel="noopener noreferrer"&gt;Why Everyone’s Obsessed With MCP Right Now&lt;/a&gt;&lt;br&gt;&lt;br&gt;
by&lt;a href="https://www.reddit.com/user/According-Site9848/" rel="noopener noreferrer"&gt;u/According-Site9848&lt;/a&gt; in&lt;a href="https://www.reddit.com/r/automation/" rel="noopener noreferrer"&gt;automation&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&amp;lt;!--kg-card-end: html--&amp;gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Over the last year, that early experimentation has turned into meaningful adoption.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;MCP servers are now a core part of how modern agents interact with tools and data sources.  The number of publicly available and deployable MCP servers has grown. So has the adoption.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2crc7eyo2k3275eh0fy1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2crc7eyo2k3275eh0fy1.png" alt="Introducing the MCP Gateway!" width="800" height="585"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;More importantly, this adoption is happening in enterprise-aligned scenarios where teams are connecting dozens or hundreds of MCP servers in real deployments.&lt;/p&gt;

&lt;p&gt;The pattern is clear across early adopters: internal MCP servers exposing company systems, third-party vendor servers, and open-source tools, all flowing through the same agent workflows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why operating MCP at scale introduces new requirements
&lt;/h2&gt;

&lt;p&gt;A typical MCP setup today includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Third-party MCP servers maintained outside the organization&lt;/li&gt;
&lt;li&gt;Internal MCP servers owned by platform or infrastructure teams&lt;/li&gt;
&lt;li&gt;Agents consuming both as part of a single workflow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Using multiple types of MCP servers changes the operational equation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Authentication stops being uniform.&lt;/strong&gt; Third-party servers use OAuth. Internal servers integrate with Okta or Entra ID. Agents now need credentials for both and the list keeps growing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Access control gets complex fast.&lt;/strong&gt; Some tools should be broadly available. Others need tight scoping by team, agent, or environment. Managing this at the agent or server level doesn't scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Visibility fragments by default.&lt;/strong&gt; Usage data lives with each MCP server. Understanding which tools are actually being used, by which agents, and with what impact requires stitching together multiple sources.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Operational changes become risky.&lt;/strong&gt; Rotating credentials, updating access, or enforcing new policies means touching individual servers or redeploying agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How Portkey's MCP Gateway works&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The gateway sits between agents and MCP servers. It moves operational concerns out of individual servers and agent runtimes, and into a shared control plane managed by platform teams.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unified authentication across MCP servers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7c0ioaubdiukrypqtbgs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7c0ioaubdiukrypqtbgs.png" alt="Introducing the MCP Gateway!" width="800" height="146"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Yes&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In ad-hoc setups, each MCP server handles auth independently. Agents juggle credentials. Platform teams can't rotate keys without touching every deployment.&lt;/p&gt;

&lt;p&gt;The gateway centralizes authentication. Servers are authenticated once. The gateway handles downstream auth with each MCP server say OAuth flows, API keys, custom headers, whatever that server requires.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For internal MCP servers:&lt;/strong&gt; Connect without distributing credentials. Access is managed at the organization, workspace, or team level.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For enterprise identity:&lt;/strong&gt; Integrate directly with Okta, Entra ID, or any OIDC/OAuth-compliant provider. Users authenticate with existing enterprise credentials.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For third-party servers:&lt;/strong&gt; OAuth flows and API keys are managed at the gateway level, not scattered across agent configs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For existing APIs:&lt;/strong&gt; The gateway can auto-generate MCP tools from OpenAPI specs, bringing them into the same auth model without building custom servers.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Fine-grained authorization and access control&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Authentication gets agents in the door. Authorization determines what they can actually do.&lt;/p&gt;

&lt;p&gt;The gateway enforces fine-grained access control across:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Organizations, workspaces, and teams&lt;/li&gt;
&lt;li&gt;Individual users or agents&lt;/li&gt;
&lt;li&gt;Specific MCP servers&lt;/li&gt;
&lt;li&gt;Individual tools within those servers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Internal systems stay tightly scoped. Common utilities are available broadly. Permissions are policy decisions managed at the gateway—not implementation details buried in agent code.&lt;/p&gt;

&lt;p&gt;Change access controls without redeploying agents or touching MCP servers.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Centralized MCP server and tool registry&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Once authentication is centralized, access needs to be explicit.&lt;/p&gt;

&lt;p&gt;Portkey’s MCP Gateway introduces a centralized registry that makes MCP servers and tools explicit, discoverable, and manageable. Every MCP server is registered once and becomes part of a controlled inventory rather than a hidden dependency embedded in agent runtimes.&lt;/p&gt;

&lt;p&gt;The registry provides a clear view of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Internal, external, and third-party MCP servers&lt;/li&gt;
&lt;li&gt;The tools each server exposes&lt;/li&gt;
&lt;li&gt;Ownership and intended scope&lt;/li&gt;
&lt;li&gt;Usage across agents and teams&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Platform teams can see which tools are being used, by which agents, how frequently, and in what context. This makes it possible to identify critical dependencies, spot unused or risky tools, and understand how MCP capabilities are actually being consumed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frgsh0bqhvmpjz25e4qin.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frgsh0bqhvmpjz25e4qin.png" alt="Introducing the MCP Gateway!" width="800" height="446"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;MCP server usage data&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;By attaching usage data directly to MCP servers and tools, the registry supports informed decisions around access, policy, and lifecycle management.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Policy enforcement at the access layer&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The gateway applies policies at the access layer, before requests reach LLMs or MCP servers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Centralized guardrails:&lt;/strong&gt; Enforce PII redaction, content filtering, and compliance policies once. They apply uniformly across every agent and workflow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Block risky tool calls:&lt;/strong&gt; Define which tools can take which actions. Prevent unauthorized invocations before they execute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consistent policies across LLMs and MCP:&lt;/strong&gt; Use a single control plane to govern both model interactions and tool usage.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;End-to-end observability for MCP usage&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Portkey’s MCP Gateway provides end-to-end observability across the full agent execution path.&lt;/p&gt;

&lt;p&gt;Every MCP interaction is captured in context:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which agent made the request&lt;/li&gt;
&lt;li&gt;Which MCP server and tool were invoked&lt;/li&gt;
&lt;li&gt;What parameters were passed&lt;/li&gt;
&lt;li&gt;How the call behaved and whether it succeeded or failed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This MCP-level visibility is correlated with LLM activity, giving platform teams a single, coherent view of how agents reason, act, and interact with tools. When something goes wrong, it is possible to trace failures back to the exact tool call or policy decision that caused them.&lt;/p&gt;

&lt;p&gt;Combined with the MCP registry and policy enforcement, this visibility allows teams to operate MCP with the same rigor as any other production system.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What this enables for platform teams&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;With the gateway in place, MCP becomes infrastructure you can operate deliberately:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Safe sharing&lt;/strong&gt; of MCP servers and tools across teams&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clear ownership&lt;/strong&gt; and lifecycle management for MCP capabilities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Centralized enforcement&lt;/strong&gt; of security, compliance, and usage policies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Faster iteration&lt;/strong&gt; without embedding control logic into agents or servers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most importantly: MCP can move from isolated usage to organization-wide adoption without increasing operational risk.&lt;/p&gt;

&lt;h3&gt;
  
  
  Built on production infrastructure
&lt;/h3&gt;

&lt;p&gt;The gateway runs on the same infrastructure that processes &lt;strong&gt;640 billion tokens monthly&lt;/strong&gt; for Fortune 50 companies including DoorDash and Roche.&lt;/p&gt;

&lt;p&gt;It integrates directly with Portkey's AI Gateway. Manage models, MCP servers, tools, policies, and observability through one control plane. Use them together or independently.&lt;/p&gt;

&lt;p&gt;Also, the gateway is fully open source.&lt;/p&gt;

&lt;p&gt;Inspect the code. Run it in your environment. Extend it for your use case. No opaque infrastructure in critical paths. No vendor lock-in.&lt;/p&gt;

&lt;p&gt;This is infrastructure you can trust because you can verify it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Speed up your MCP adoption today
&lt;/h2&gt;

&lt;p&gt;MCP has changed how agents access tools. What’s changing now is how teams operate MCP once it becomes shared infrastructure.&lt;/p&gt;

&lt;p&gt;As adoption accelerates, the challenge is no longer protocol support. It’s access control, policy enforcement, visibility, and security. These are platform problems, and they require platform-level solutions.&lt;/p&gt;

&lt;p&gt;Portkey’s &lt;a href="https://portkey.ai/features/mcp" rel="noopener noreferrer"&gt;MCP Gateway&lt;/a&gt; is open source and ready to run.&lt;/p&gt;

&lt;p&gt;Explore the repository, review the architecture, and try it with your existing MCP servers and agents.&lt;/p&gt;

&lt;p&gt;If you’re operating MCP today or planning to roll it out across your organization, get started or book a demo with us!&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>mcpservers</category>
      <category>mcpregistry</category>
    </item>
    <item>
      <title>MCP tool discovery for autonomous LLM agents</title>
      <dc:creator>Drishti Shah</dc:creator>
      <pubDate>Fri, 26 Dec 2025 12:51:38 +0000</pubDate>
      <link>https://forem.com/portkey/mcp-tool-discovery-for-autonomous-llm-agents-3n3</link>
      <guid>https://forem.com/portkey/mcp-tool-discovery-for-autonomous-llm-agents-3n3</guid>
      <description>&lt;p&gt;Large language model agents are increasingly expected to work with real systems like files, APIs, databases, execution environments through tools. As the number of available tools grows, however, most agent architectures struggle to scale in a clean, autonomous way.&lt;/p&gt;

&lt;p&gt;The paper &lt;strong&gt;“MCP-Zero: Active Tool Discovery for Autonomous LLM Agents”&lt;/strong&gt; introduces a different approach to tool use: instead of giving models a fixed set of tools up front, it lets agents &lt;em&gt;actively request&lt;/em&gt; the tools they need, when they need them. (&lt;a href="https://arxiv.org/html/2506.01056v3?ref=portkey.ai" rel="noopener noreferrer"&gt;Source&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;At a high level, MCP-Zero reframes tool usage as a capability discovery problem rather than a retrieval or prompt-injection problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why current tool selection approaches break down
&lt;/h2&gt;

&lt;p&gt;Today, most LLM agents use one of these two patterns for tool access:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;System-prompt injection&lt;/strong&gt; , where all available tool schemas are included in the prompt&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval-based selection&lt;/strong&gt; , where tools are selected once based on the user’s initial query&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But these have limitations at scale.&lt;/p&gt;

&lt;p&gt;Injecting full tool schemas = extreme context overhead. Even a single MCP server can require thousands of tokens just to describe its tools, and large MCP ecosystems can exceed &lt;strong&gt;200k tokens&lt;/strong&gt; in total. This forces models into a passive role, scanning oversized prompts instead of reasoning about the task.&lt;/p&gt;

&lt;p&gt;Retrieval-based approaches reduce prompt size, but they still assume that tool needs are static. Many agent tasks are multi-step. So, new requirements emerge as the agent reads files, inspects outputs, or encounters errors. Selecting tools once, just based on the initial user query, often results in missing or incorrect tools later in the task.&lt;/p&gt;

&lt;p&gt;The key issue is autonomy: in both cases, the agent does not control how it acquires new capabilities.&lt;/p&gt;

&lt;h2&gt;
  
  
  How's MCP-Zero different?
&lt;/h2&gt;

&lt;p&gt;MCP-Zero starts from a simple shift in responsibility: &lt;strong&gt;the agent itself decides when it needs a tool&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of being handed a predefined list of tools, the model is prompted to explicitly declare missing capabilities as they arise during task execution. When the agent realizes it cannot proceed with its own knowledge, it gives a structured tool request that describes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the &lt;strong&gt;server domain&lt;/strong&gt; (platform or permission scope)&lt;/li&gt;
&lt;li&gt;the &lt;strong&gt;tool operation&lt;/strong&gt; it needs to perform&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This request is generated by the model, not inferred from the user query. That distinction matters. Model-generated requests tend to align much more closely with tool documentation than raw user input, which improves matching accuracy downstream.&lt;/p&gt;

&lt;p&gt;Importantly, the agent can make &lt;strong&gt;multiple tool requests over the course of a single task&lt;/strong&gt; , rather than being limited to a single selection step at the beginning.&lt;/p&gt;

&lt;p&gt;Once the agent gives a tool request, MCP-Zero uses a &lt;strong&gt;hierarchical semantic routing&lt;/strong&gt; process to locate relevant tools efficiently.&lt;/p&gt;

&lt;p&gt;This happens in two stages:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Server-level filtering&lt;/strong&gt;
The system first matches the requested server domain against MCP server descriptions (and enriched summaries) to narrow the search space.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool-level ranking&lt;/strong&gt;
Within the selected servers, individual tools are ranked based on semantic similarity between the requested operation and the tool descriptions.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By separating server selection from tool selection, MCP-Zero avoids scanning thousands of tools at once while preserving precision.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo8z8pzvz6vicvrqnm3d6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo8z8pzvz6vicvrqnm3d6.png" width="781" height="1840"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;MCP-Zero’s iterative active invocation (Source)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A key difference between MCP-Zero and prior work is that &lt;strong&gt;tool discovery is iterative&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;After receiving a tool, the agent evaluates whether it is sufficient for the current subtask. If it is not, the agent can refine its request and trigger another retrieval cycle. This creates a natural feedback loop:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;analyze the task&lt;/li&gt;
&lt;li&gt;request a tool&lt;/li&gt;
&lt;li&gt;use the tool&lt;/li&gt;
&lt;li&gt;reassess what’s missing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is especially important for multi-step workflows such as debugging code, where tools from different domains (filesystem, code editing, execution) are needed in sequence.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for MCP-based agent systems
&lt;/h2&gt;

&lt;p&gt;As the MCP ecosystem grows with hundreds of servers &amp;amp; thousands of tools, the cost of naïvely exposing everything to an agent becomes unsustainable. Prompt injection does not scale, and retrieval-only approaches still treat agents as passive consumers of tools.&lt;/p&gt;

&lt;p&gt;MCP-Zero shows a different path: &lt;strong&gt;treat tool discovery as an agent capability, not an external preprocessing step&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;MCP-Zero reframes tool usage as an active, agent-driven process rather than a static configuration problem. By allowing agents to request tools on demand, route them efficiently, and refine their capabilities over time, it addresses both the scalability and autonomy limits of existing approaches.&lt;/p&gt;

&lt;p&gt;For MCP-based agent systems, this points toward a more sustainable design pattern, where agents stay lightweight, adaptive, and in control of how they extend their own capabilities.&lt;/p&gt;

&lt;p&gt;Source: &lt;a href="https://arxiv.org/abs/2506.01056?ref=portkey.ai" rel="noopener noreferrer"&gt;arXiv:2506.01056 [cs.AI]&lt;/a&gt;&lt;/p&gt;

</description>
      <category>modelcontextprotocol</category>
      <category>agents</category>
    </item>
    <item>
      <title>Enterprise MCP access control: managing tools, servers, and agents</title>
      <dc:creator>Drishti Shah</dc:creator>
      <pubDate>Tue, 23 Dec 2025 10:31:32 +0000</pubDate>
      <link>https://forem.com/portkey/enterprise-mcp-access-control-managing-tools-servers-and-agents-4fh</link>
      <guid>https://forem.com/portkey/enterprise-mcp-access-control-managing-tools-servers-and-agents-4fh</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo5noqqrocivnpj6f74yx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo5noqqrocivnpj6f74yx.png" alt="Enterprise MCP access control: managing tools, servers, and agents" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;MCP has moved fast.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvg0i9enprd1k8bh9lbcd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvg0i9enprd1k8bh9lbcd.png" alt="Enterprise MCP access control: managing tools, servers, and agents" width="800" height="542"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Source&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;What started as a convenient way to plug tools into an AI workflow is now being used as shared infrastructure across teams.&lt;/p&gt;

&lt;p&gt;What began as a way to expose tools to a single agent is now being used as shared infrastructure across teams and applications. It’s common to see multiple MCP servers running in parallel: some backing internal tools, others connecting to external services, and some dedicated to specific environments like development or production.&lt;/p&gt;

&lt;p&gt;As MCP adoption grows, teams also start centralizing discovery. Instead of hardcoding MCP endpoints, they publish servers and tools through an MCP registry so agents can dynamically discover what’s available. This makes MCP easier to adopt and scale, but it also changes how access is managed.&lt;/p&gt;

&lt;p&gt;At this stage, MCP starts to resemble other pieces of shared infrastructure. Multiple clients, agents, and workloads rely on the same servers and tools, often across organizational boundaries.&lt;/p&gt;

&lt;p&gt;This shift sets the stage for why MCP access control becomes necessary as it moves from experimentation to sustained use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why MCP access control matters now
&lt;/h2&gt;

&lt;p&gt;Teams are standing up MCP servers for internal services. Tools are getting published through registries so agents can discover them dynamically. Multiple users, applications, and environments begin connecting to the same MCP endpoints.&lt;/p&gt;

&lt;p&gt;In early setups, this risk is easy to ignore. Once MCP servers are shared and discoverable, access is no longer implicit. Connecting to an &lt;a href="https://portkey.ai/mcp-servers?ref=portkey.ai" rel="noopener noreferrer"&gt;MCP server&lt;/a&gt; means gaining visibility into the tools it exposes, and tools represent real capabilities: reading data, modifying systems, or triggering downstream actions.&lt;/p&gt;

&lt;p&gt;Without access control, any connected client or agent can see and attempt to use every exposed tool. In practice, this leads to over-permissioned agents, unclear ownership, and difficulty reasoning about who can do what.&lt;/p&gt;

&lt;p&gt;In production environments, the consequences are more serious: accidental writes, unintended data access, or misuse of high-cost tools.&lt;/p&gt;

&lt;p&gt;This is the point where MCP access control stops being optional. If MCP servers are treated as shared infrastructure, they need the same kind of access boundaries that other internal platforms rely on.&lt;/p&gt;

&lt;h2&gt;
  
  
  What MCP access control actually means
&lt;/h2&gt;

&lt;p&gt;MCP access control is often misunderstood as a simple authentication problem. In practice, authentication only answers one question: &lt;em&gt;who is connecting&lt;/em&gt;. Access control answers the harder one: _ &lt;strong&gt;what that client or agent is allowed to do once connected&lt;/strong&gt;._&lt;/p&gt;

&lt;p&gt;At first glance, MCP access control can look like an authentication problem. You might assume that once a client or agent is identified, access is implicitly safe. In reality, authentication is only the starting point.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Authentication&lt;/strong&gt; identifies &lt;em&gt;who is connecting&lt;/em&gt; to an MCP server.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Authorization&lt;/strong&gt; defines &lt;em&gt;which MCP servers and tools a client or an agent is allowed to access&lt;/em&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://portkey.ai/blog/understanding-mcp-authorization" rel="noopener noreferrer"&gt;MCP authorization&lt;/a&gt; is the harder problem. A single MCP server can expose multiple tools with very different risk profiles. Some tools are read-only, others modify state or trigger workflows. Treating all authenticated clients as equally trusted ignores these differences.&lt;/p&gt;

&lt;p&gt;MCP server access control needs to operate beyond simple allow-or-deny at the connection level. It must define which tools are visible, which actions are permitted, and which are explicitly forbidden, often based on context like environment or workload.&lt;/p&gt;

&lt;p&gt;When handled correctly, access control becomes an explicit policy layer rather than scattered logic inside individual servers. That clarity is what makes enterprise MCP access control possible as usage continues to grow.&lt;/p&gt;

&lt;h3&gt;
  
  
  How AI agents change the access control model
&lt;/h3&gt;

&lt;p&gt;AI agents interact with MCP very differently from traditional API clients. Instead of calling a single known endpoint with a fixed purpose, agents explore. They discover available tools, reason about which ones might help, and invoke them in sequence based on intermediate outcomes.&lt;/p&gt;

&lt;p&gt;This behavior changes how access control needs to be designed. An agent may enumerate all exposed tools on an MCP server, attempt multiple variations of an action, or chain tools together in ways that weren’t explicitly anticipated.&lt;/p&gt;

&lt;p&gt;In MCP environments without strong access boundaries, agents tend to inherit more capability than they need. A tool exposed for internal maintenance might be visible to a customer-facing agent. Over time, access becomes implicit and difficult to reason about.&lt;/p&gt;

&lt;p&gt;This is why MCP access control for agents needs to be explicit and restrictive by default. Authorization decisions must account for the agent’s role, environment, and workload context, not just its identity.&lt;/p&gt;

&lt;h2&gt;
  
  
  What good MCP access control looks like
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Policy-driven authorization&lt;/strong&gt;
Access decisions are defined declaratively, not hardcoded inside MCP servers. Policies clearly specify who can connect, which tools are visible, and which actions are allowed or denied.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clear separation of server-level and tool-level access&lt;/strong&gt;
Server access determines who can connect at all, while tool-level permissions restrict what can actually be invoked.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Least-privilege by default&lt;/strong&gt;
Clients and agents only see the tools they are meant to use, reducing overexposure as MCP environments grow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explicit deny boundaries&lt;/strong&gt;
Sensitive, destructive, or high-cost tools are intentionally restricted, even as new servers or agents are added.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Centralized and auditable rules&lt;/strong&gt;
Access intent is easy to review and reason about, without tracing through server code or scattered configurations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extensible as MCP usage evolves&lt;/strong&gt;
New MCP servers, tools, and environments inherit consistent access behavior without reimplementing permission logic.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Making MCP access control practical at scale
&lt;/h2&gt;

&lt;p&gt;As MCP becomes a shared layer across agents, tools, and teams, access control can’t live inside individual servers or evolve informally. It needs to be enforced consistently, audited easily, and extended as MCP usage grows.&lt;/p&gt;

&lt;p&gt;This is where a centralized control plane matters. Portkey’s &lt;a href="https://portkey.ai/features/mcp?ref=portkey.ai" rel="noopener noreferrer"&gt;MCP gateway&lt;/a&gt; brings MCP servers under the same governance layer teams already use for AI model access. Instead of managing access rules separately for models and tools, platform teams can apply unified policies across both, with clear boundaries, environment awareness, and auditability built in.&lt;/p&gt;

&lt;p&gt;By treating MCP access control as part of your broader AI infrastructure, you avoid fragmented permissions, reduce operational overhead, and make MCP safe to run in production environments.&lt;/p&gt;

&lt;p&gt;If you’re starting to use MCP beyond local or experimental setups, this is the right point to introduce structured access control. To see how MCP gateway can help you govern MCP servers and tools as first-class infrastructure, &lt;a href="https://portkey.sh/blog?ref=portkey.ai" rel="noopener noreferrer"&gt;book a demo&lt;/a&gt; with us!&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>mcpservers</category>
    </item>
    <item>
      <title>What is a virtual MCP server: Need, benefits, use cases</title>
      <dc:creator>Drishti Shah</dc:creator>
      <pubDate>Mon, 22 Dec 2025 13:55:09 +0000</pubDate>
      <link>https://forem.com/portkey/what-is-a-virtual-mcp-server-need-benefits-use-cases-3f6i</link>
      <guid>https://forem.com/portkey/what-is-a-virtual-mcp-server-need-benefits-use-cases-3f6i</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F83nci5pdyyddtnqqi83c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F83nci5pdyyddtnqqi83c.png" alt="What is a virtual MCP server: Need, benefits, use cases" width="800" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;MCP adoption inside organizations has moved quickly beyond single-developer setups.&lt;/p&gt;

&lt;p&gt;What started as a convenient way to connect an AI client to a handful of tools is now being used across:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple internal teams&lt;/li&gt;
&lt;li&gt;Shared agent platforms&lt;/li&gt;
&lt;li&gt;SaaS and third-party MCP servers&lt;/li&gt;
&lt;li&gt;Central MCP registries and hubs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each team brings its own MCP servers. Each server exposes its own set of tools. And each client or agent ends up wiring directly into multiple MCP sources.&lt;/p&gt;

&lt;h2&gt;
  
  
  The need for a virtual MCP server
&lt;/h2&gt;

&lt;p&gt;When MCP servers multiply across teams and providers, the growing challenge is it’s &lt;strong&gt;composition and provisioning&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;MCP registries help you find available servers. But they don’t help you answer questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which tools should this agent actually see?&lt;/li&gt;
&lt;li&gt;How do we group tools that belong together?&lt;/li&gt;
&lt;li&gt;How do we provision MCP access consistently across teams?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;strong&gt;virtual MCP server&lt;/strong&gt; sits one layer above the MCP registry to solve this.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What is a virtual MCP server?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;A virtual MCP Server lets you create a combine a set of MCP tools from multiple MCP servers.&lt;/p&gt;

&lt;p&gt;From the client’s point of view, there is just one MCP server to connect to.&lt;br&gt;&lt;br&gt;
Behind the scenes, that virtual server maps to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple underlying MCP servers&lt;/li&gt;
&lt;li&gt;A curated subset of tools from each&lt;/li&gt;
&lt;li&gt;A predefined structure that matches how teams and agents actually work&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This changes how MCP is provisioned. Rather than wiring every client to many MCP servers, platform teams can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create virtual MCP servers per app, agent, or workflow&lt;/li&gt;
&lt;li&gt;Group related tools together once&lt;/li&gt;
&lt;li&gt;Reuse those groupings across environments and teams&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Common use cases for virtual MCP servers
&lt;/h2&gt;

&lt;p&gt;Virtual MCP servers are typically adopted when MCP usage needs to be simplified, scoped, or standardized across teams and environments.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Task-specific tool bundles for agents&lt;/strong&gt;
Expose only the exact tools an agent needs, even when those tools come from multiple MCP servers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjj5d5gqwlaysutgykmef.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjj5d5gqwlaysutgykmef.png" alt="What is a virtual MCP server: Need, benefits, use cases" width="738" height="798"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Source&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Simplified onboarding for internal teams&lt;/strong&gt;
Give teams a single MCP endpoint with a curated tool set instead of asking them to wire multiple MCP servers manually.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Environment-specific MCP configurations&lt;/strong&gt;
Use different virtual MCP servers for development, staging, and production without changing client or agent code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safer consumption of third-party MCP servers&lt;/strong&gt;
Consume external or community MCP servers through a controlled layer that limits exposed tools.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stable MCP interfaces despite upstream changes&lt;/strong&gt;
Pin tool definitions and MCP server versions to avoid breaking agents when source MCPs evolve.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Virtual MCP servers still need governance
&lt;/h2&gt;

&lt;p&gt;As virtual MCP servers become the primary way teams consume MCP, they naturally turn into a critical control point.&lt;/p&gt;

&lt;p&gt;They sit between agents and real systems, aggregating tools from internal and third-party MCP servers and exposing them through a single interface. At this layer, questions around access, visibility, and safety start to matter more than simple connectivity.&lt;/p&gt;

&lt;p&gt;Teams need to know which agents or applications can use a given virtual MCP, how those tools are being invoked in practice, and whether usage is behaving as expected. Without governance, virtual MCP servers risk becoming another blind spot powerful, but hard to reason about once they’re in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where platforms like Portkey come in
&lt;/h2&gt;

&lt;p&gt;Platforms like Portkey provide the infrastructure needed to operate virtual MCP servers safely at scale.&lt;/p&gt;

&lt;p&gt;Portkey’s &lt;a href="https://portkey.ai/features/mcp?ref=portkey.ai" rel="noopener noreferrer"&gt;MCP Gateway&lt;/a&gt; acts as a control plane for MCP usage, giving teams visibility into tool calls, latency, and failures while enforcing consistent access boundaries across environments. Instead of managing MCP composition in isolation, platform teams can treat virtual MCP servers as part of their broader AI platform, alongside models, routing, and observability.&lt;/p&gt;

&lt;p&gt;This makes it possible to roll out MCP across teams without losing control or adding operational overhead.&lt;/p&gt;

&lt;p&gt;If you’re already working with multiple MCP servers, the next step is treating MCP as platform infrastructure.&lt;/p&gt;

&lt;p&gt;To see how &lt;strong&gt;Portkey’s MCP Gateway&lt;/strong&gt; helps teams run and govern MCP servers, &lt;a href="https://portkey.sh/blog?ref=portkey.ai" rel="noopener noreferrer"&gt;book a demo&lt;/a&gt; with us.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Understanding MCP Authorization</title>
      <dc:creator>Drishti Shah</dc:creator>
      <pubDate>Thu, 18 Dec 2025 07:35:10 +0000</pubDate>
      <link>https://forem.com/portkey/understanding-mcp-authorization-334j</link>
      <guid>https://forem.com/portkey/understanding-mcp-authorization-334j</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fetyxe5q4fhhz435u5u3z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fetyxe5q4fhhz435u5u3z.png" alt="Understanding MCP Authorization" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;MCP makes it possible for AI models and agents to interact with external tools, APIs, and data sources through a standardized interface. As MCP moves from local experimentation to shared servers, registries, and production deployments, one question becomes unavoidable: &lt;strong&gt;who is allowed to do what, and under which conditions?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is where &lt;strong&gt;authorization&lt;/strong&gt; comes in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why MCP needs authorization
&lt;/h2&gt;

&lt;p&gt;MCP introduces a powerful execution layer between AI clients and real systems. Once a client is connected to an MCP server, it can discover tools, invoke actions, and access data dynamically often without hard-coded call paths or predefined workflows.&lt;/p&gt;

&lt;p&gt;That flexibility changes the security model.&lt;/p&gt;

&lt;h3&gt;
  
  
  MCP turns intent into capability
&lt;/h3&gt;

&lt;p&gt;In traditional applications, developers explicitly wire which APIs can be called and under what conditions. With MCP, an AI client can decide &lt;em&gt;at runtime&lt;/em&gt; which tools to invoke based on model outputs.&lt;/p&gt;

&lt;p&gt;Without authorization controls, this effectively means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every connected client can attempt to use every exposed tool&lt;/li&gt;
&lt;li&gt;Tool access is governed by availability, not permission&lt;/li&gt;
&lt;li&gt;Risk scales with the number of tools, not the number of users&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Early MCP setups often relied on implicit trust. A local MCP server, a single user, and a small set of non-destructive tools can make broad access feel acceptable during experimentation. In these environments, the cost of a mistake is low, and security boundaries are often assumed rather than enforced.&lt;/p&gt;

&lt;p&gt;In shared development environments, internal platforms serving multiple teams, central MCP registries, or production agents with real side effects, implicit trust no longer holds. Simply knowing that a client connected to a server is not sufficient.&lt;/p&gt;

&lt;p&gt;At that point, systems must explicitly define which tools a client is allowed to invoke, which resources it can access, and which actions are prohibited altogether.&lt;/p&gt;

&lt;p&gt;AI agents further amplify authorization risk because they behave very differently from traditional API consumers. Rather than following fixed call paths, agents explore their environment. They may enumerate available tools, retry failed actions with modified inputs, or chain multiple tools together in ways the developer did not anticipate.&lt;/p&gt;

&lt;p&gt;This exploratory behavior is useful for capability discovery, but it also increases the likelihood of unintended outcomes.&lt;/p&gt;

&lt;p&gt;Without clear authorization boundaries, that exploration can easily cross into unsafe territory. Agents may access data they were not meant to see, perform unintended writes or deletions, or overuse high-privilege or high-cost tools.&lt;/p&gt;

&lt;p&gt;Authorization is what constrains this behavior, ensuring that even when agents explore, they do so within well-defined and enforceable limits.&lt;/p&gt;

&lt;h2&gt;
  
  
  How MCP authorization works
&lt;/h2&gt;

&lt;p&gt;MCP authorization is built around a simple but strict principle: &lt;strong&gt;the server is the final authority&lt;/strong&gt;. Clients may discover capabilities and attempt tool calls, but it is always the MCP server that decides whether a request is allowed to proceed.&lt;/p&gt;

&lt;p&gt;Authorization is evaluated at request time, not assumed at connection time. This distinction matters because MCP clients, especially agents, can change behavior dynamically.&lt;/p&gt;

&lt;p&gt;In practice, authorization decisions are derived from identity and context. The server receives an authenticated request, extracts information about the caller (such as user, service, workspace, or environment), and evaluates that against its authorization rules. Those rules determine which tools are visible, which resources can be accessed, and which operations are explicitly denied.&lt;/p&gt;

&lt;p&gt;Unlike static API integrations, MCP authorization is not just about endpoint access. It often operates at a finer granularity: tool-level permissions, action-level constraints, and contextual limits based on metadata. This allows the same MCP server to safely serve multiple clients with different levels of access, without duplicating infrastructure or exposing unnecessary capabilities.&lt;/p&gt;

&lt;p&gt;Another important aspect is that authorization in MCP is continuous. Permission is not granted once and assumed forever. Each tool invocation can be independently authorized, logged, and audited. This makes it possible to revoke access, tighten policies, or introduce new constraints without redeploying clients or retraining agents.&lt;/p&gt;

&lt;p&gt;Taken together, MCP authorization acts as the control plane that sits between AI decision-making and real-world execution. It ensures that no matter how flexible or autonomous a client becomes, its actions remain bounded by explicit, enforceable rules.&lt;/p&gt;

&lt;h2&gt;
  
  
  Authorization models and patterns in MCP
&lt;/h2&gt;

&lt;p&gt;MCP does not mandate a single authorization mechanism, but most real-world deployments converge on a few common patterns. These patterns are shaped by two constraints:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;MCP servers need strong control over tool execution, &lt;/li&gt;
&lt;li&gt;and clients—especially agents—must be able to operate without embedding long-lived secrets.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;One widely adopted approach is &lt;strong&gt;token-based authorization&lt;/strong&gt; , often aligned with OAuth 2.1 semantics for HTTP-based MCP servers. In this model, a client authenticates through an identity provider and receives a short-lived access token. That token encodes scopes or claims describing what the client is allowed to do. Each MCP request carries the token, and the server evaluates it before allowing any tool invocation.&lt;/p&gt;

&lt;p&gt;Another common pattern is &lt;strong&gt;scoped capability access&lt;/strong&gt;. Instead of granting blanket access to an MCP server, tokens or credentials are limited to a specific subset of tools or actions. For example, a client may be allowed to call read-only tools but not tools that mutate data or trigger external side effects.&lt;/p&gt;

&lt;p&gt;In multi-tenant or platform scenarios, &lt;strong&gt;role-based and attribute-based authorization&lt;/strong&gt; becomes important. Rather than authorizing individual clients one by one, servers define roles or policies tied to attributes such as workspace, environment, team, or workload type. When a request arrives, the server evaluates these attributes and applies the appropriate policy.&lt;/p&gt;

&lt;p&gt;Some deployments also separate &lt;strong&gt;authentication from execution authority&lt;/strong&gt;. A client may authenticate as a known identity, but receive a more limited execution token when interacting with specific MCP servers. This avoids passing high-privilege credentials downstream and ensures that even trusted clients operate under constrained permissions when invoking tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  Best practices for MCP authorization
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Apply least privilege by default&lt;/strong&gt;
Expose only the tools and actions a client truly needs. High-impact or side-effecting tools should always require stricter permissions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use short-lived, scoped tokens&lt;/strong&gt;
Avoid long-lived secrets. Expiring tokens with explicit scopes reduce blast radius and support easy revocation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authorize using context, not just identity&lt;/strong&gt;
Enforce policies based on workspace, environment, tenant, workload type, and request metadata—not just who the client is.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enforce authorization on every tool call&lt;/strong&gt;
Do not rely on connect-time checks. Evaluate permissions per invocation to contain agent exploration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delegate access safely&lt;/strong&gt;
Mint constrained execution tokens for agents instead of passing upstream credentials through them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Make authorization auditable&lt;/strong&gt;
Log who invoked which tool, under what policy, and why the request was allowed or denied.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;MCP makes it easy for AI clients and agents to interact with real systems, but that flexibility comes with risk. Authorization is what turns MCP from a powerful abstraction into a production-ready interface.&lt;/p&gt;

&lt;p&gt;Strong MCP authorization means enforcing clear, contextual permissions at the server boundary, so tools are accessible without being overexposed, and agents can operate without inheriting excessive privilege. This is especially important as MCP moves into shared platforms, registries, and production workloads.&lt;/p&gt;

&lt;p&gt;Platforms like Portkey’s &lt;a href="https://portkey.ai/features/mcp?ref=portkey.ai" rel="noopener noreferrer"&gt;MCP Gateway&lt;/a&gt; build these controls in by default, combining MCP connectivity with scoped access, centralized policy enforcement, and full auditability.&lt;/p&gt;

&lt;p&gt;The result is an MCP setup that supports experimentation and autonomy, while still meeting the security and governance expectations of real-world deployments. If you're looking to get started with MCP in your organization, &lt;a href="https://portkey.sh/blog?ref=portkey.ai" rel="noopener noreferrer"&gt;connect with Portkey&lt;/a&gt; today!&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>mcpservers</category>
    </item>
    <item>
      <title>LLM routing techniques for high-volume applications</title>
      <dc:creator>Drishti Shah</dc:creator>
      <pubDate>Fri, 05 Dec 2025 12:31:01 +0000</pubDate>
      <link>https://forem.com/portkey/llm-routing-techniques-for-high-volume-applications-ljo</link>
      <guid>https://forem.com/portkey/llm-routing-techniques-for-high-volume-applications-ljo</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvo85rciow03vkdf4jo0v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvo85rciow03vkdf4jo0v.png" alt="LLM routing techniques for high-volume applications" width="800" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;High-volume AI applications don't depend on a single model or provider.   &lt;/p&gt;

&lt;p&gt;Routing steps in as the control layer that keeps applications stable under load. For teams serving millions of requests, this flexibility is the difference between smooth operation and unpredictable outages.&lt;/p&gt;

&lt;p&gt;In this blog, we'll discuss different LLM routing techniques and how you can apply them to your AI workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How routing differs from simple model selection&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Most teams begin with static model selection: pick one model for a task, wire the endpoint, and ship. This works at small scale but breaks down quickly under real production traffic. High-volume systems need something more adaptive, and that’s where LLM routing techniques come in.&lt;/p&gt;

&lt;p&gt;Model selection is a one-time architectural decision. Routing is continuous decision-making. Instead of binding every request to a predetermined model, you can evaluate each request against real-time signalsand choose the best model for that moment.&lt;/p&gt;

&lt;p&gt;Static selection assumes the environment is stable. Routing assumes things will vary. And at scale, they always do. Traffic surges, providers throttle, latency drifts regionally, or a newer model becomes more cost-efficient. Routing absorbs this volatility so applications don’t have to.&lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;strong&gt;Challenges in high-volume workloads that routing solves&lt;/strong&gt;
&lt;/h1&gt;

&lt;p&gt;As applications scale, the behavior of LLMs becomes increasingly uneven. Providers operate under fluctuating load, pricing structures shift, and latency patterns vary across regions and times of day. LLM routing techniques exist precisely to handle these challenges.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Unpredictable latency variance&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Even the best models can swing from 200ms to 800ms. When every millisecond matters—search, chat, support automation—these spikes degrade user experience. Routing helps maintain consistency by choosing the fastest healthy endpoint in real time.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Rate limits and throttling&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Providers impose hard and soft limits that are easy to hit during traffic surges. Retry storms make this worse. Effective routing distributes load across models or providers to avoid saturating any single endpoint.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Cost instability at scale&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;What seems affordable at 10,000 requests becomes unsustainable at 10 million. Always using frontier models creates unnecessary spend for routine or low-sensitivity tasks. Routing enables cheaper paths without compromising output quality where it matters.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Provider degradation or outages&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Models occasionally slow down, return elevated error rates, or fail entirely. Without routing, these incidents become application-wide outages. With routing, systems can detect degradation and shift traffic within seconds.&lt;/p&gt;

&lt;p&gt;High-volume environments make these challenges unavoidable but with the right LLM routing techniques, they become manageable, predictable, and often invisible to users.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLM routing techniques
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Conditional routing
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Latency-based routing&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Some applications need consistently fast responses—search, support chat, and consumer-facing assistants. Latency-based routing directs traffic to the model or provider that historically responds the fastest for the given region or workload.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost-based routing&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
High-volume workloads amplify every pricing difference. Instead of sending all requests to a frontier model, cost-based routing classifies traffic and steers low-sensitivity or repetitive tasks to cheaper models.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Content rewrites → economical models&lt;/li&gt;
&lt;li&gt;Complex planning → high-accuracy models&lt;/li&gt;
&lt;li&gt;Internal automation → mid-tier models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is one of the most impactful LLM routing techniques for keeping spend predictable at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Region-aware routing&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Routing requests to the closest deployment or provider region improves latency and reduces cross-border data movement. For regulated industries or global applications, &lt;a href="https://portkey.ai/blog/geo-location-based-llm-routing" rel="noopener noreferrer"&gt;geo-location based routing&lt;/a&gt; is essential for both speed and compliance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Semantic routing&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Semantic routing classifies the intent of the request and routes it to the model best suited for that task. This can be embedding-based, rule-based, or classifier-driven.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Coding tasks → code-optimized models&lt;/li&gt;
&lt;li&gt;Summaries → small, fast models&lt;/li&gt;
&lt;li&gt;Reasoning tasks → frontier models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This approach dramatically improves quality-to-cost ratio.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Metadata-based routing&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Traffic often varies by team, tier, or workload type. Metadata-based routing uses contextual information to determine which model a request should hit.&lt;/p&gt;

&lt;p&gt;This allows organizations to enforce:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;different cost ceilings per team&lt;/li&gt;
&lt;li&gt;stricter models for sensitive workloads&lt;/li&gt;
&lt;li&gt;lightweight models for experimentation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Metadata becomes a simple but powerful dimension in LLM routing techniques, especially inside large organizations.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Performance-based routing techniques&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;While conditional routing relies on predefined rules, performance-based routing reacts to what’s happening &lt;em&gt;right now&lt;/em&gt;. These &lt;strong&gt;LLM routing techniques&lt;/strong&gt; use real-time signals like latency, error rates, throughput, and health checks to make decisions dynamically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Load-based routing&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
LLM providers often enforce strict rate limits. When traffic spikes, a single endpoint can saturate quickly, leading to throttling or elevated error rates. Load-based routing allows you to distribute requests across multiple models or providers to keep every endpoint below its capacity threshold.&lt;/p&gt;

&lt;p&gt;This is especially important for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;consumer-scale applications with bursty traffic&lt;/li&gt;
&lt;li&gt;agent workloads generating many parallel requests&lt;/li&gt;
&lt;li&gt;batch processing pipelines with high concurrency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By smoothing load across the system, it prevents cascading failures and maintains consistent throughput.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fallback routing&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Providers can degrade without warning from latency spikes, error codes rise, or throughput collapses. Fallback routing ensures that when a primary model fails or slows down, traffic automatically switches to a secondary or tertiary model.&lt;/p&gt;

&lt;p&gt;Fallback chains are critical for uptime-sensitive workloads because they remove the single point of failure inherent in relying on any one provider.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Circuit breakers&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Circuit breakers temporarily “trip” a failing route to protect the system. When error rates or timeouts cross a threshold, the route is disabled for a cooldown period while traffic is diverted elsewhere.&lt;/p&gt;

&lt;p&gt;This prevents the system from repeatedly calling a degraded endpoint, avoids retry storms, and stabilizes behavior during provider outages or maintenance windows.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Canary testing&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Before adopting a new model or upgrading to a new version, canary routing sends a small percentage of traffic—1–5%—to the candidate model. Teams can observe quality, latency, and cost metrics in production before rolling out fully.&lt;/p&gt;

&lt;p&gt;This is one of the safest LLM routing techniques for introducing new models without risking production regressions.&lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;strong&gt;Observability requirements for effective routing&lt;/strong&gt;
&lt;/h1&gt;

&lt;p&gt;Routing only works as well as the signals you feed into it. High-volume systems need continuous visibility into model behavior so that routing decisions remain accurate, timely, and cost-aware. Without &lt;a href="https://portkey.ai/features/observability?ref=portkey.ai" rel="noopener noreferrer"&gt;LLM observability&lt;/a&gt;, even the most advanced LLM routing techniques become guesswork.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Real-time latency and throughput metrics&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Latency isn’t static. Providers fluctuate throughout the day, and different regions behave differently. Routing engines need a live view of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;p50 / p95 latency&lt;/li&gt;
&lt;li&gt;token throughput (tokens/sec)&lt;/li&gt;
&lt;li&gt;variance across regions and providers&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Error rates and provider health&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Errors tend to spike before full outages. Tracking failure patterns helps performance-based routing decide when to apply fallbacks or trip circuit breakers. Key indicators include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;timeout frequency&lt;/li&gt;
&lt;li&gt;429 and 5xx spikes&lt;/li&gt;
&lt;li&gt;connection or rate-limit failures&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Cost and token usage visibility&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Routing decisions, especially cost-based ones, depend on knowing token consumption and effective cost per request. High-volume teams need granular insight into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;token usage per model&lt;/li&gt;
&lt;li&gt;cost per route&lt;/li&gt;
&lt;li&gt;cost drift over time&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Model comparison insights&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To refine routing rules, teams need to know how models actually perform under real workloads. Observability should surface:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;quality deltas across models&lt;/li&gt;
&lt;li&gt;model regressions after version changes&lt;/li&gt;
&lt;li&gt;output differences for similar inputs&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Feedback loops&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The best routing setups continuously learn. Observability closes the loop between offline evaluation and online performance, allowing teams to adjust conditions, thresholds, and fallback priorities as the system evolves.&lt;/p&gt;

&lt;p&gt;With proper observability, LLM routing techniques become predictable, measurable, and aligned with real user experience.&lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;strong&gt;How Portkey supports advanced LLM routing at scale&lt;/strong&gt;
&lt;/h1&gt;

&lt;p&gt;High-volume teams need routing that adapts instantly to shifting latency, cost, or provider performance. Portkey was built as an &lt;a href="https://portkey.ai/features/ai-gateway?ref=portkey.ai" rel="noopener noreferrer"&gt;AI Gateway&lt;/a&gt; to give teams this exact control layer without building it themselves. It brings both conditional and performance-based LLM routing techniques into a single, production-grade system.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Multi-provider routing with one API&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Portkey connects to 1,600+ models across leading providers. This makes routing decisions fully transparent and manageable through one unified interface.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Dynamic per-request routing&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Routing rules can be applied using real-time context:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;budget ceilings or cost thresholds&lt;/li&gt;
&lt;li&gt;metadata (team, workload, risk level)&lt;/li&gt;
&lt;li&gt;rate limits&lt;/li&gt;
&lt;li&gt;guardrail checks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means each request can take the most efficient path without manual orchestration.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Built-in performance protection: fallbacks, circuit breakers, and load management&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Portkey continuously monitors provider health and automatically redirects traffic when a model slows, errors spike, or rate limits are reached. This protects applications from upstream volatility and prevents outages from spreading.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Budgets, guardrails, and semantic caching for cost control&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;High-volume workloads often face cost drift. Portkey pairs routing with budget policies, guardrails, and semantic caching so teams can keep spend predictable even as usage scales.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;End-to-end observability&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Latency, token usage, errors, cost, and routing decisions are all captured automatically. This visibility helps teams tune routing strategies, evaluate model performance, and understand how decisions impact reliability and cost.&lt;/p&gt;

&lt;p&gt;If you want to adopt intelligent, multi-provider LLM routing techniques without building the infrastructure yourself, Portkey’s AI Gateway gives you a ready-to-use routing layer that scales reliably with your traffic.&lt;/p&gt;

&lt;p&gt;If you'd like to apply these LLM routing techniques across providers, you can &lt;a href="https://portkey.sh/blogs?ref=portkey.ai" rel="noopener noreferrer"&gt;try it&lt;/a&gt; out with Portkey or &lt;a href="https://portkey.sh/blog?ref=portkey.ai" rel="noopener noreferrer"&gt;book a demo&lt;/a&gt; with the team to learn more.&lt;/p&gt;

</description>
      <category>aigateway</category>
    </item>
    <item>
      <title>LLM access control in multi-provider environments</title>
      <dc:creator>Drishti Shah</dc:creator>
      <pubDate>Wed, 03 Dec 2025 08:01:21 +0000</pubDate>
      <link>https://forem.com/portkey/llm-access-control-in-multi-provider-environments-2m9i</link>
      <guid>https://forem.com/portkey/llm-access-control-in-multi-provider-environments-2m9i</guid>
      <description>&lt;p&gt;Most organizations have already adopted a mix of AI providers and open source models. This flexibility has become the norm, but it also brings a new governance challenge: every provider comes with different tokens, permissions, limits, and safety settings.&lt;/p&gt;

&lt;p&gt;As adoption spreads across departments, organizations need a defined access control layer. Strong LLM access control brings order to this complexity, keeping usage safe, predictable, and aligned with institutional standards while still supporting experimentation and development.&lt;/p&gt;

&lt;h2&gt;
  
  
  What LLM access control means in a multi-provider world
&lt;/h2&gt;

&lt;p&gt;LLM access control is the set of policies and permissions that determine &lt;strong&gt;who&lt;/strong&gt; can use &lt;strong&gt;which&lt;/strong&gt; models, &lt;strong&gt;under what conditions&lt;/strong&gt; , and &lt;strong&gt;with what safeguards&lt;/strong&gt;. In a single-provider setup, this is often limited to API keys and basic permissions. But in a multi-provider environment, access control becomes far more layered.&lt;/p&gt;

&lt;p&gt;Different providers expose different capabilities, cost structures, context limits, and safety defaults. Teams also bring their own tools i.e., agents, notebooks, apps, and integrations, each with its own access patterns. Without a consistent control layer, every team ends up managing permissions differently, which leads to policy drift, duplicated credentials, and uneven security across the organization.&lt;/p&gt;

&lt;p&gt;In this environment, access control must span several levels at once:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Provider-level access&lt;/strong&gt; : deciding who can use OpenAI, Anthropic, Azure, Google, or local models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Account and subscription access&lt;/strong&gt; : handling enterprise subscriptions, project accounts, and shared keys.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model-level access&lt;/strong&gt; : allowing or blocking specific models depending on role, risk, or cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User and team-level permissions&lt;/strong&gt; : mapping people and departments to the right capabilities.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Application-level enforcement&lt;/strong&gt; : ensuring internal tools, agents, and services follow the same policies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together, these levels form the baseline for predictable, governed LLM usage across providers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Roles and permissions: who can access what
&lt;/h2&gt;

&lt;p&gt;Role-based access control (RBAC) is the foundation of LLM access control. It ensures that people only use the models, providers, and capabilities that match their responsibilities and that these permissions stay consistent across every AI provider and internal tool.&lt;/p&gt;

&lt;p&gt;At the top level, &lt;strong&gt;organization administrators&lt;/strong&gt; manage global access. They decide which providers are available, what enterprise accounts are connected, and which high-cost or sensitive models need restricted access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Project or department owners&lt;/strong&gt; handle access within their own scope. They assign roles to team members, choose which models support their work, and apply local policies that fit their department’s goals. This matters especially in research and education environments where different groups operate with different levels of risk tolerance and funding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;End-users i.e.,&lt;/strong&gt; developers, researchers, analysts, or students inherit access based on their role. Their permissions determine which models they can call, how much they can consume, and the guardrails applied to their requests.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://portkey.ai/for/rbac?ref=portkey.ai" rel="noopener noreferrer"&gt;RBAC&lt;/a&gt; becomes far more important in multi-provider setups. Access to GPT-4.1, Claude Opus, Gemini 1.5 Pro, or a specialized open-source model may be limited to specific teams. Some users may only work with smaller, lower-cost models. These distinctions keep usage predictable and prevent unnecessary spend or unintended exposure of sensitive data.&lt;/p&gt;

&lt;p&gt;Identity systems like SSO, SCIM, and directory groups tie everything together. When a user joins, leaves, or changes roles, their LLM permissions follow automatically without anyone needing to manage individual API keys.&lt;/p&gt;

&lt;h2&gt;
  
  
  Budgets and consumption controls
&lt;/h2&gt;

&lt;p&gt;LLM usage can grow faster than expected, especially when multiple teams work across different providers. LLM access control keeps that usage predictable. They define how much a department, project, or individual user can spend, and they ensure that consumption doesn’t outpace governance.&lt;/p&gt;

&lt;p&gt;Budgets often operate at several levels. Organizations can set department-wide allocations, while project owners can apply more specific limits that match their workloads. Individual users may have personal quotas to prevent runaway scripts, aggressive experimentation, or unexpected spikes.&lt;/p&gt;

&lt;p&gt;Administrators can also apply &lt;a href="https://portkey.ai/docs/product/enterprise-offering/budget-policies?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;strong&gt;usage limit policies&lt;/strong&gt;&lt;/a&gt;, which add flexibility beyond static budgets. These policies enforce limits based on dynamic conditions such as API keys, metadata, workspace, environment, or user role.&lt;/p&gt;

&lt;p&gt;Multi-provider setups increase the importance of these controls. Each provider has different pricing structures, context window sizes, and token costs. A model that feels affordable for early experimentation may become expensive as workloads grow. Consumption controls help teams choose the right models for the right jobs and prevent hidden costs from accumulating.&lt;/p&gt;

&lt;p&gt;Real-time tracking and enforcement complete the picture.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpvrp1ad3axqnz3rggzth.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpvrp1ad3axqnz3rggzth.png" width="686" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When a user or team approaches a limit, the system can automatically route to lower-cost or lower-capacity models. This ensures consistent access while keeping spending aligned with organizational expectations.&lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;strong&gt;Rate limits and operational safeguards&lt;/strong&gt;
&lt;/h1&gt;

&lt;p&gt;Rate limits are another essential part of LLM access control. They determine how frequently users or applications can call a model and help keep workloads stable across different teams and providers. Without rate limits, a single script, agent loop, or high-volume application can saturate capacity, causing slowdowns or blocking access for others.&lt;/p&gt;

&lt;p&gt;Rate limits operate at multiple layers. Providers impose their own limits per account or per model. Organizations then apply internal limits to make sure usage is distributed fairly across departments and that production workloads aren’t affected by experimentation or testing. These internal limits often vary by user role, environment, or model sensitivity.&lt;/p&gt;

&lt;p&gt;Operational safeguards build on these limits by automating responses when the system is under load. For example, requests can be queued, throttled, or &lt;a href="https://portkey.ai/features/ai-gateway?ref=portkey.ai" rel="noopener noreferrer"&gt;AI gateway&lt;/a&gt; to automatically routed to alternate providers when one endpoint is saturated.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fodfweocsiwg3y5halcoy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fodfweocsiwg3y5halcoy.png" width="800" height="369"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These safeguards keep critical applications responsive and protect shared enterprise accounts from unpredictable demand.&lt;/p&gt;

&lt;p&gt;In multi-provider environments, rate limits also help balance traffic across different models and clouds. This prevents accidental over-reliance on a single provider and ensures continuity even when a provider enforces sudden throttling or undergoes downtime.&lt;/p&gt;

&lt;p&gt;Together, rate limits and safeguards provide strong LLM access control and the operational stability needed to support both high-volume production use cases and day-to-day exploration across an organization.&lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;strong&gt;Guardrails as part of LLM access control&lt;/strong&gt;
&lt;/h1&gt;

&lt;p&gt;Guardrails extend access control beyond “who can call what” to “what is allowed within those calls.” They create an operational safety layer that shapes the inputs, outputs, and behaviors of LLMs, ensuring that usage stays within institutional policies even when users have broad access.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpwnt498ajfa81etzimu8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpwnt498ajfa81etzimu8.png" width="588" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At the input level, guardrails can enforce data rules. Sensitive information like PII, PHI, or research data can be redacted, masked, or blocked before reaching a model. This matters in environments where students, researchers, or distributed teams work with diverse datasets, not all of which should be exposed to every provider.&lt;/p&gt;

&lt;p&gt;Output guardrails help maintain policy alignment. They can filter or transform responses that violate content guidelines, prevent potentially harmful instructions, or restrict the model from generating certain types of information. This keeps results consistent with institutional standards and reduces manual moderation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdl6af9ls8y6nr8dqk4is.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdl6af9ls8y6nr8dqk4is.png" width="588" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Guardrails also apply to &lt;strong&gt;capabilities&lt;/strong&gt; , especially in agentic systems. Administrators can allow or deny access to specific tools, APIs, or file operations, and constrain what an agent is permitted to execute. This is critical when multiple teams build custom workflows, automation, or research assistants.&lt;/p&gt;

&lt;p&gt;In multi-provider environments, guardrails ensure consistency. Different providers offer varying safety settings and defaults. A central guardrail layer enforces the same policies across all models regardless of their vendor, size, or capabilities, so that teams don’t need to manage separate rule sets.&lt;/p&gt;

&lt;p&gt;By combining access rules, consumption controls, and guardrails, organizations can give users more flexibility without losing oversight or exposing themselves to unnecessary risk.&lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;strong&gt;What good LLM access control looks like&lt;/strong&gt;
&lt;/h1&gt;

&lt;p&gt;Effective LLM access control is simple to understand but difficult to implement without a unified layer. At its core, a well-designed system includes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clear identities and roles&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Every user is mapped through SSO, directory groups, or SCIM, and roles determine which models and providers they can use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consistent model and provider permissions&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
No shadow keys, no mismatched configurations. Access rules apply uniformly across OpenAI, Anthropic, Azure, Google, and any internal or open-source models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Budgets and usage limits that reflect real needs&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Teams have predictable consumption, with dynamic limits that adjust for roles, environments, and metadata.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rate limits that protect shared workloads&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Critical applications stay stable even during spikes, experimentation, or heavy agent activity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Guardrails that shape allowed behavior&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Both inputs and outputs follow institutional guidelines, and agent tools are restricted to what’s safe and necessary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-time enforcement&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Policies activate at the moment of the request, not after the fact.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unified visibility and auditability&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;a href="https://portkey.ai/blog/the-complete-guide-to-llm-observability/" rel="noopener noreferrer"&gt;LLM observability&lt;/a&gt; allows teams can monitor usage, policy hits, and provider performance in one place, without stitching together logs from multiple clouds.&lt;/p&gt;

&lt;p&gt;When these elements come together, organizations get a controlled, predictable, and safe AI environment without blocking innovation or slowing teams down.&lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;strong&gt;How Portkey helps&lt;/strong&gt;
&lt;/h1&gt;

&lt;p&gt;Portkey's &lt;a href="https://portkey.ai/blog/what-is-an-llm-gateway/" rel="noopener noreferrer"&gt;AI Gateway&lt;/a&gt; brings all of these controls into a single platform that works across every model and provider. You can set roles, model permissions, budgets, rate limits, and guardrails once, and Portkey enforces them consistently across OpenAI, Anthropic, Azure, Google, and on-prem models.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flyil6wnui90uct8f9le3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flyil6wnui90uct8f9le3.png" width="800" height="111"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Policies operate in real time, so overspend, policy violations, and unsafe requests are caught before they reach a model.&lt;/p&gt;

&lt;p&gt;Portkey also centralizes identity, usage visibility, audit logs, and provider monitoring. Teams get a governed environment they can rely on, while still having the freedom to experiment, evaluate new models, and run production workloads without managing scattered keys or inconsistent policies.&lt;/p&gt;

&lt;p&gt;If you’d like to exercise LLM access control in your organisation, you can &lt;a href="https://portkey.sh/blog?ref=portkey.ai" rel="noopener noreferrer"&gt;book a demo&lt;/a&gt; with our team.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>aigateway</category>
    </item>
    <item>
      <title>Buyer’s guide to LLM observability tools</title>
      <dc:creator>Drishti Shah</dc:creator>
      <pubDate>Thu, 27 Nov 2025 12:18:51 +0000</pubDate>
      <link>https://forem.com/portkey/buyers-guide-to-llm-observability-tools-29kk</link>
      <guid>https://forem.com/portkey/buyers-guide-to-llm-observability-tools-29kk</guid>
      <description>&lt;p&gt;Observability for LLM systems has evolved from a debugging utility to a business function. When AI applications scale from prototypes to production, every model call represents real cost, latency, and reliability risk. Teams now need more than basic logs, they need full visibility into how models perform, drift, and behave in real-world conditions.&lt;/p&gt;

&lt;p&gt;But with this shift comes a question every AI platform team faces sooner or later:&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Should we build our own observability stack or buy a specialized tool?&lt;/strong&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;strong&gt;The visibility gap and what LLM observability should cover&lt;/strong&gt;
&lt;/h1&gt;

&lt;p&gt;While the adoption of LLMs has experienced a significant boost, running LLM applications in production has proven to be more challenging compared to traditional ML applications. This difficulty arises from the massive model sizes, intricate architecture, and non-deterministic outputs of LLMs. Furthermore, troubleshooting issues originating in LLM applications is a time-consuming and resource-intensive task due to the black-box nature of their decision-making processes.&lt;/p&gt;

&lt;p&gt;Regular observability tools are insufficient for modern LLM apps or agents due to their complex multi-provider and agentic flow, as well as constant hallucinations, compliance gaps, cost spikes, and changing output quality, which makes &lt;a href="https://portkey.ai/blog/the-complete-guide-to-llm-observability/" rel="noopener noreferrer"&gt;&lt;u&gt;LLM observability&lt;/u&gt;&lt;/a&gt; a core business function of AI. Here we are listing down the six &lt;strong&gt;core pillars&lt;/strong&gt; your observability system must be able to do:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Handle &lt;strong&gt;runtime metrics&lt;/strong&gt; such as latency, success rates, error codes, and costs incurred by token usage.&lt;/li&gt;
&lt;li&gt;Ensure &lt;strong&gt;quality&lt;/strong&gt; by checking accuracy, detecting hallucinations, eval coverage, and grounding.&lt;/li&gt;
&lt;li&gt;Maintain ongoing &lt;strong&gt;governance,&lt;/strong&gt; including logs, access controls, guardrails, PII checks, and audit trails.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Track&lt;/strong&gt; how much the provider, model, user, and workspace have spent with built-in budgets and alerts.&lt;/li&gt;
&lt;li&gt;In case of failures, it must &lt;strong&gt;monitor&lt;/strong&gt; slowdowns, retries, failovers, and behavioral changes over time.&lt;/li&gt;
&lt;li&gt;As the AI’s outputs are influenced by prompts, data, and agent actions, they guide the AI's behavior, contextual understanding, and ability to perform tasks. Thus, LLM observability must also &lt;strong&gt;trace&lt;/strong&gt; agents, data sources, and user prompts.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All of these factors make observability an important business function for AI stacks, as it provides a holistic view of IT systems and enables learning about the current state based on the environment and the data it generates. &lt;/p&gt;

&lt;p&gt;Enterprises must decide whether to invest time making a custom observability stack or use a third-party tool that already has the necessary features. By examining both options, you can determine which one suits your needs in terms of time, money, and features. In the following sections, we will explore this. Let's get started.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The “Build” path: When teams try to create LLM observability in-house&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Building LLM observability in-house enables teams to tailor monitoring to their specific needs using tools like OpenTelemetry for traces and logs, as well as Grafana, ELK, or BI platforms for dashboards.&lt;/p&gt;

&lt;p&gt;Building your own LLM observability systems provides your team with strong control over your data and security, keeping all sensitive information within your infrastructure. This is especially important for industries with strict compliance requirements, such as healthcare, finance, or government, where maintaining privacy and meeting regulatory standards are crucial.&lt;/p&gt;

&lt;p&gt;This also provides your team with a customised solution, enabling you to tailor features such as security rules, integrations, dashboards, and model tuning to match the exact needs of your internal systems. This flexibility ensures the system grows and adapts as your organization evolves, enabling smoother workflows and a unified internal ecosystem.&lt;/p&gt;

&lt;p&gt;Additionally, an in-house solution lets you scale instantly without waiting for vendor approvals or external resource allocation.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The challenges of building an LLM observability In-house&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Companies starting with only a few models may consider building an in-house monitoring tool as a cost-effective way to test out the value of their models.&lt;/p&gt;

&lt;p&gt;While this approach offers flexibility and full control, it often requires significant development effort, such as logging and tracing schemas, ongoing maintenance of custom SDKs, normalizing metrics across different models, providers, and building pipelines to track lineage, drift, and guardrails.&lt;/p&gt;

&lt;p&gt;Even with this, the dashboards require constant monitoring across all AI workflows as the complexity and quantity of models increase, and an in-house monitoring tool may no longer be sufficient with the limited coverage for specific LLM components, such as prompt evals, hallucination tracking, and cost attribution.&lt;/p&gt;

&lt;p&gt;Most organizations spend 6–12 months just to create a basic version, and ongoing updates are also necessary to accommodate the evolving models and workloads. Hallucination detection and agent tracing also require enterprise-level AI expertise, making it tedious.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The “Buy” path: What off-the-shelf solutions offer&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;A ready LLM observability solution provides teams with a turnkey solution to monitor, analyze, and optimize AI models. By eliminating the need for in-house infrastructure, it enables faster deployment, seamless scalability, and comprehensive insights into model performance, reliability, and usage—allowing teams to focus on improving AI outcomes rather than managing tooling with fewer resources and lower cost.&lt;/p&gt;

&lt;p&gt;Such platforms work across models and comes with a purpose-built telemetry for LLMs and agents. It offers ready dashboards for latency, cost, drift, and quality, along with built-in eval frameworks for hallucinations, grounding, and other LLM-specific behaviors. &lt;/p&gt;

&lt;p&gt;These platforms integrate seamlessly with LLM gateways, routers, and model providers, enabling smarter routing decisions based on real performance data such as costs and governance. This leads to more efficient usage, better quality outputs, and a reliable LLM stack.&lt;/p&gt;

&lt;p&gt;It can be set up in days instead of months. It evolves with automatic updates, with new models and capabilities. This requires less ongoing maintenance, versioning, and compatibility checks while you still own the entire system.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What to consider when buying an LLM observability solution&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;When buying an observability tool, it's important to evaluate how the solution fits your specific environment, whether you need open-source or commercial options, and your organization's future goals.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Key dimensions for evaluation&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Metrics:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data types:&lt;/strong&gt; Does it support the essential pillars of observability (logs, metrics, and traces)?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Querying &amp;amp; visualization:&lt;/strong&gt; How easy is it to query data, create meaningful dashboards, and set up alerts effectively?&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Integrations:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ecosystem compatibility:&lt;/strong&gt; Does it offer out-of-the-box integrations for your existing infrastructure, languages, databases, and third-party services?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open standards:&lt;/strong&gt; Does it support open standards like OpenTelemetry for vendor-neutral data collection?&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Scale:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Performance:&lt;/strong&gt; Can the system handle your current data volume and projected future growth without performance degradation?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data retention:&lt;/strong&gt; Does it provide flexible data retention policies to meet both compliance requirements and performance needs?&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Governance:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data quality &amp;amp; control:&lt;/strong&gt; Are there mechanisms for tagging, sampling, and controlling the quality and volume of data ingested to avoid "garbage in, garbage out"?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Access control:&lt;/strong&gt; Does it offer robust role-based access control (RBAC) to manage who can view, edit, or delete data and configurations?&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Costs:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pricing model:&lt;/strong&gt; Is the pricing predictable (e.g., based on data volume, hosts, or usage), and are there transparent cost-management tools?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total cost of ownership (TCO):&lt;/strong&gt; Beyond the sticker price, consider the operational costs of maintenance, training, and potential overages.&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Privacy and Security:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Compliance:&lt;/strong&gt; Does the solution help you meet relevant industry and regulatory compliance standards (e.g., GDPR, HIPAA, SOC 2)?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data handling:&lt;/strong&gt; How are sensitive data and PII handled, masked, or encrypted both in transit and at rest?&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Deployment Options:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Flexibility:&lt;/strong&gt; Does it offer options for SaaS, self-hosted, or hybrid deployment models to match your operational preferences and security requirements?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud compatibility:&lt;/strong&gt; If cloud-native, is it compatible with your specific cloud provider (e.g., AWS, Azure, GCP)? &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Platforms like Portkey are adaptable and can support your future technology stack. They also provide flexible deployments, open standards support, and enterprise-grade security, making it a reliable choice for scaling AI applications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build or buy an LLM observability platform&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Factor&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;Build if...&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;Buy if...&lt;/p&gt;

&lt;p&gt;|&lt;br&gt;
| &lt;/p&gt;

&lt;p&gt;Control&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;You need full internal control or have strict privacy constraints&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;You prefer managed, compliant oversight&lt;/p&gt;

&lt;p&gt;|&lt;br&gt;
| &lt;/p&gt;

&lt;p&gt;Resources&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;You have a strong platform/infrastructure team for handling AI operations&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;You want a fast setup without engineering overhead&lt;/p&gt;

&lt;p&gt;|&lt;br&gt;
| &lt;/p&gt;

&lt;p&gt;Complexity&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;You only need basic latency or cost metrics&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;You need deep evaluations, lineage, and governance&lt;/p&gt;

&lt;p&gt;|&lt;br&gt;
| &lt;/p&gt;

&lt;p&gt;Scale&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;Traffic is low and predictable &lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;You run high-volume, multi-model workflows and multiple team setups&lt;/p&gt;

&lt;p&gt;|&lt;br&gt;
| &lt;/p&gt;

&lt;p&gt;Future readiness&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;Your team can maintain and update observability and make quick iterations when needed.&lt;/p&gt;

&lt;p&gt;| &lt;/p&gt;

&lt;p&gt;You want ongoing updates and a new model, as well as existing ecosystem support.&lt;/p&gt;

&lt;p&gt;|&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why Portkey is a strong option for enterprises looking for LLM observability&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Recognised as &lt;a href="https://portkey.ai/blog/portkey-in-2025-gartner-cool-vendors-in-llm-observability/" rel="noopener noreferrer"&gt;Cool Vendor in LLM Observability&lt;/a&gt;, Portkey provides visibility above the &lt;a href="https://portkey.ai/features/ai-gateway?ref=portkey.ai" rel="noopener noreferrer"&gt;AI gateway&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Portkey's Observability exists for teams that want enterprise-grade observability without taking on the engineering overhead of building and scaling it internally.&lt;/p&gt;

&lt;p&gt;It provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Unified tracing across models, providers, and agents&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Structured logs that capture prompts, completions, token counts, cost, latency, and guardrail outcomes&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Built-in quality, safety, and reliability signals&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Team and department-level governance with budgets, rate limits, RBAC&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Teams using Portkey don’t have to stitch together custom pipelines, dashboards, and schemas or maintain compatibility across changing model APIs. Instead, they start with a production-ready observability layer that evolves with the ecosystem.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsghnkezhl4qv7tz1yub8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsghnkezhl4qv7tz1yub8.png" width="800" height="405"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Software Engineer in the Software Industry gives Portkey 4/5 Rating in Gartner Peer Insights™ Generative AI Engineering Market. Read the full review here: &lt;a href="https://gtnr.io/vbcumMl7N" rel="noopener noreferrer"&gt;https://gtnr.io/vbcumMl7N&lt;/a&gt; #gartnerpeerinsights&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you’re exploring LLM observability, Portkey can help you get there faster, with less operational overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Want to see how this works in practice?&lt;/strong&gt; &lt;a href="https://portkey.sh/blog?ref=portkey.ai" rel="noopener noreferrer"&gt;&lt;strong&gt;Book a demo&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;with our team.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>tooling</category>
      <category>llm</category>
      <category>monitoring</category>
    </item>
  </channel>
</rss>
