<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Frank Fiegel</title>
    <description>The latest articles on Forem by Frank Fiegel (@punkpeye).</description>
    <link>https://forem.com/punkpeye</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F883707%2Fdaaf6c2e-12aa-402f-bf6d-24ebcd38db1f.jpg</url>
      <title>Forem: Frank Fiegel</title>
      <link>https://forem.com/punkpeye</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/punkpeye"/>
    <language>en</language>
    <item>
      <title>The Hackers Who Tracked My Sleep Cycle</title>
      <dc:creator>Frank Fiegel</dc:creator>
      <pubDate>Thu, 26 Mar 2026 16:17:47 +0000</pubDate>
      <link>https://forem.com/punkpeye/the-hackers-who-tracked-my-sleep-cycle-21a</link>
      <guid>https://forem.com/punkpeye/the-hackers-who-tracked-my-sleep-cycle-21a</guid>
      <description>&lt;p&gt;This is a short story about how I caught hackers timing their attacks around my daily routine.&lt;/p&gt;

&lt;p&gt;A couple of weeks ago, in the middle of the night, I got alerts on my phone that various metrics were going haywire.&lt;/p&gt;

&lt;p&gt;The only thing I noticed was a huge spike in sign-ups. This coincided with one of my articles being on the front page of Hacker News, so I didn't think much of it.&lt;/p&gt;

&lt;p&gt;It had spiked for a few hours and flattened out just before I started troubleshooting, so I went back to sleep.&lt;/p&gt;

&lt;p&gt;When I woke up the next morning, a few thousand more accounts had appeared.&lt;/p&gt;

&lt;p&gt;I thought this was a bit suspicious, but ... even if someone was being naughty, they seemed to have given up.&lt;/p&gt;

&lt;p&gt;Nothing else was out of the ordinary that day. I even decided to stay up later that night to see if the pattern would repeat. Nothing.&lt;/p&gt;

&lt;p&gt;As a small precaution, I activated CAPTCHA to slow down the sign-ups and went back to sleep.&lt;/p&gt;

&lt;p&gt;The next morning, you guessed it... same pattern.&lt;/p&gt;

&lt;p&gt;This time, I decided to do a deep dive into the data.&lt;/p&gt;

&lt;p&gt;What I found was that the hackers were...&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;creating thousands of accounts&lt;/li&gt;
&lt;li&gt;adding a valid payment method to each account&lt;/li&gt;
&lt;li&gt;running a single very expensive LLM call (2-3 USD)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This would let the first request go through, then trigger a charge to their payment method. The payment method gets rejected, but the request has already been processed.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The reason the first request goes through is that we deposit a little bit of money into the account when it's created. A nominal amount that's enough to play with the API. However, if a payment method is added, we allow people to go into overdraft, which is why their expensive LLM call goes through.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Anyway, using this method, they would get away with about a thousand dollars' worth of credits every night, which kept them interested in the service.&lt;/p&gt;

&lt;p&gt;But what caught my attention wasn't the money – it was the timing. The attacks coincided with my sleep cycle.&lt;/p&gt;

&lt;p&gt;At this point, I still thought of it as unlucky timing, but even then something just didn't sit right with me.&lt;/p&gt;

&lt;p&gt;Coincidentally, that day I decided to take a break and disconnect from my computer early. And lo and behold, just 30 minutes after I shut down my computer, I got the first notification.&lt;/p&gt;

&lt;p&gt;I logged in to check, and it stopped.&lt;/p&gt;

&lt;p&gt;Went to play some games and ... 30 minutes later, I got the second notification.&lt;/p&gt;

&lt;p&gt;That's when it clicked – the timing of the attacks wasn't random. They were checking my Discord status to see if I was online.&lt;/p&gt;

&lt;p&gt;Sure enough, I confirmed this by setting myself as offline on Discord, and the attacks popped right back up.&lt;/p&gt;

&lt;p&gt;Over the next few days, I used this insight to mess with the hackers and to use them as my personal pen testers.&lt;/p&gt;

&lt;p&gt;I didn't want to remove free credits for everyone, so I began experimenting with different ways to deter future attackers.&lt;/p&gt;

&lt;p&gt;I would make a change, then go 'offline'. I'd watch them troubleshoot their automations until they figured out a workaround. Then I'd go back 'online'. The attacks would resume, and the cycle would repeat.&lt;/p&gt;

&lt;p&gt;I mostly forgot about the entire incident until they came back to try their luck again the following week. But despite a few nights of alerts about tripwires getting triggered, they never managed to get more than the few cents we deposited into new accounts – not enough of an incentive to keep trying.&lt;/p&gt;

&lt;p&gt;In the end, it was the cat-and-mouse game that made the whole experience worth it. I got free pen testing; they got a few dollars.&lt;/p&gt;

&lt;h2&gt;
  
  
  Card testing vulnerability
&lt;/h2&gt;

&lt;p&gt;I wasn't surprised about the overdraft feature being abused. This was something we were aware of and treated as a conscious trade-off between convenience and risk of abuse.&lt;/p&gt;

&lt;p&gt;The bigger issue was that this made me realize that a malicious actor could abuse our system for &lt;a href="https://docs.stripe.com/disputes/prevention/card-testing" rel="noopener noreferrer"&gt;card testing&lt;/a&gt;. That's a widespread problem and one that will get your Stripe account flagged. When researching this problem, I didn't find many effective solutions, so I wanted to dedicate part of this blog post to sharing what I learned.&lt;/p&gt;

&lt;p&gt;Here's what I tried and how it held up:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Effectiveness&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Device fingerprinting&lt;/td&gt;
&lt;td&gt;Ineffective. Fingerprints are great for detecting legitimate returning users (e.g. to bypass CAPTCHA), but because they are easy to fake, they are not effective at detecting malicious actors.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IP address blocking&lt;/td&gt;
&lt;td&gt;Ineffective. Residential proxies are cheap and easy to get.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CAPTCHA&lt;/td&gt;
&lt;td&gt;Mild deterrent. Ineffective. Many existing solutions to bypass CAPTCHA.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OTP&lt;/td&gt;
&lt;td&gt;Mild deterrent. Ineffective. Many existing solutions to bypass OTP.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JA4&lt;/td&gt;
&lt;td&gt;Somewhat effective. &lt;a href="https://blog.cloudflare.com/ja4-signals/" rel="noopener noreferrer"&gt;JA4&lt;/a&gt; is a TLS fingerprinting method that identifies clients based on how they negotiate TLS connections. Of all data points that we collect, JA4 is the most stable identifier.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ALTCHA&lt;/td&gt;
&lt;td&gt;Somewhat effective. &lt;a href="https://altcha.org/" rel="noopener noreferrer"&gt;ALTCHA&lt;/a&gt; is a proof-of-work challenge that requires the client to solve a computational puzzle before submitting a request. When combined with prior methods, can slow down the attacks enough to deter the attacker.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rate limiting&lt;/td&gt;
&lt;td&gt;Somewhat effective. Slows down the attacks, but may hurt legitimate users.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At the end of the day, each method is individually bypassable – the game is making the combination expensive enough that the attacker moves on.&lt;/p&gt;

&lt;p&gt;Oh, and set your Discord status to offline.&lt;/p&gt;

</description>
      <category>security</category>
      <category>programming</category>
      <category>webdev</category>
    </item>
    <item>
      <title>MCP Inspector is Now Stable: A Browser-Based Tool for Testing MCP Servers</title>
      <dc:creator>Frank Fiegel</dc:creator>
      <pubDate>Sat, 17 Jan 2026 22:34:20 +0000</pubDate>
      <link>https://forem.com/punkpeye/mcp-inspector-is-now-stable-a-browser-based-tool-for-testing-mcp-servers-4dim</link>
      <guid>https://forem.com/punkpeye/mcp-inspector-is-now-stable-a-browser-based-tool-for-testing-mcp-servers-4dim</guid>
      <description>&lt;p&gt;&lt;strong&gt;For the MCP ecosystem to grow, we need better developer tools. Today, we're publicly launching MCP Inspector.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Model Context Protocol (MCP) is rapidly becoming the standard for connecting AI assistants to external tools and data sources. But as developers build MCP servers, they face a common challenge: how do you test and debug them effectively?&lt;/p&gt;

&lt;p&gt;Until now, testing meant setting up local environments, managing dependencies, or logging into platforms that collect your data. We built MCP Inspector to change that.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is MCP Inspector?
&lt;/h2&gt;

&lt;p&gt;MCP Inspector is a free, browser-based tool that lets you connect to any MCP server URL and interact with its full capabilities—tools, resources, prompts, and tasks—directly from your browser.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftnw3yppg0vobsxzroe3m.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftnw3yppg0vobsxzroe3m.jpeg" alt=" " width="800" height="591"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try it now: &lt;a href="https://glama.ai/mcp/inspector" rel="noopener noreferrer"&gt;glama.ai/mcp/inspector&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why We Built This
&lt;/h2&gt;

&lt;p&gt;When we started building MCP integrations, we found ourselves constantly switching between terminals, debugging configurations, and writing throwaway scripts just to test a single tool call. The official MCP Inspector is great for local development, but we needed something we could use anywhere—to test remote servers, share debugging sessions with teammates, or quickly verify a deployment.&lt;/p&gt;

&lt;p&gt;So we built what we needed: a zero-friction inspector that runs entirely in your browser.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Features
&lt;/h2&gt;

&lt;h3&gt;
  
  
  No Login Required
&lt;/h3&gt;

&lt;p&gt;Open the URL, paste your server address, and start inspecting. No account creation, no signup flow, no barriers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Privacy-First Design
&lt;/h3&gt;

&lt;p&gt;All requests go directly from your browser to the MCP server. We don't proxy, log, or store any of your requests or responses. Your API keys and server interactions stay between you and your server.&lt;/p&gt;

&lt;h3&gt;
  
  
  Full Protocol Support
&lt;/h3&gt;

&lt;p&gt;We didn't cut corners. MCP Inspector supports the complete MCP specification:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tools&lt;/strong&gt; — List available tools, configure parameters with a dynamic form, and execute them&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resources &amp;amp; Templates&lt;/strong&gt; — Browse and read server resources&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompts&lt;/strong&gt; — Test prompt templates with arguments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tasks&lt;/strong&gt; — Create, monitor, cancel, and retrieve results from long-running tasks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Progress Notifications&lt;/strong&gt; — See real-time progress updates during execution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Elicitations&lt;/strong&gt; — Respond to server-initiated form requests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OAuth 2.0&lt;/strong&gt; — Full OAuth flow support with dynamic client registration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bearer Tokens &amp;amp; Custom Headers&lt;/strong&gt; — Flexible authentication options&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Built-in Demo Server
&lt;/h3&gt;

&lt;p&gt;Not sure how it works? We've included a test server (&lt;code&gt;mcp-test.glama.ai/mcp&lt;/code&gt;) that demonstrates every feature—tasks, elicitations, progress notifications, audio responses, and images. It's the fastest way to understand what MCP can do.&lt;/p&gt;

&lt;h3&gt;
  
  
  Shareable Sessions
&lt;/h3&gt;

&lt;p&gt;Your entire configuration—servers, selected tools, and arguments—is stored in the URL. Bookmark it to save your setup, or share the link with a colleague to give them the exact same view.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Use It
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Go to &lt;a href="https://glama.ai/mcp/inspector" rel="noopener noreferrer"&gt;glama.ai/mcp/inspector&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Click "Add Server" and enter your MCP server URL&lt;/li&gt;
&lt;li&gt;Select your authentication method (None, OAuth, Bearer Token, or Custom Headers)&lt;/li&gt;
&lt;li&gt;Connect and start exploring&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The interface shows your server's capabilities in tabs: Tools, Resources, Resource Templates, Prompts, and Tasks. Select any item to see its details, configure parameters, and execute requests. All requests and responses are logged in the panel below, and additional debugging info is available in your browser's developer console.&lt;/p&gt;

&lt;h2&gt;
  
  
  For Local Development
&lt;/h2&gt;

&lt;p&gt;MCP Inspector works great for testing remote servers. For local development with &lt;code&gt;stdio&lt;/code&gt; transports or other local-only features, we recommend the &lt;a href="https://modelcontextprotocol.io/docs/tools/inspector" rel="noopener noreferrer"&gt;official MCP Inspector&lt;/a&gt; which is designed specifically for that workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Today
&lt;/h2&gt;

&lt;p&gt;We've been using MCP Inspector internally for months, and it's become an essential part of our development workflow. We're excited to share it with the community.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://glama.ai/mcp/inspector" rel="noopener noreferrer"&gt;Open MCP Inspector →&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Have feedback or feature requests? &lt;a href="https://glama.ai/mcp/discord" rel="noopener noreferrer"&gt;Join our Discord&lt;/a&gt; and let us know what you'd like to see.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The MCP Inspector is part of Glama's suite of tools for the Model Context Protocol ecosystem. Explore more at &lt;a href="https://glama.ai/mcp" rel="noopener noreferrer"&gt;glama.ai/mcp&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>devtools</category>
      <category>ai</category>
    </item>
    <item>
      <title>MCP vs API</title>
      <dc:creator>Frank Fiegel</dc:creator>
      <pubDate>Sun, 08 Jun 2025 15:39:11 +0000</pubDate>
      <link>https://forem.com/punkpeye/mcp-vs-api-abb</link>
      <guid>https://forem.com/punkpeye/mcp-vs-api-abb</guid>
      <description>&lt;p&gt;Every week a new thread emerges on Reddit asking about the difference between MCP and API. I've tried summarizing everything that's been said about MCP vs API in a single post (and a single table).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Traditional APIs (REST/GraphQL)&lt;/th&gt;
&lt;th&gt;Model Context Protocol (MCP)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;What it is&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Interface styles (REST, GraphQL) with optional spec formats (OpenAPI, GraphQL SDL)&lt;/td&gt;
&lt;td&gt;Standardized protocol with enforced message structure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Designed for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Human developers writing code&lt;/td&gt;
&lt;td&gt;AI agents making decisions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data location&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;REST: Path, headers, query params, body (multiple formats)&lt;/td&gt;
&lt;td&gt;Single JSON input/output per tool&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Discovery&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Static docs, regenerate SDKs for changes&lt;sup id="fnref1"&gt;1&lt;/sup&gt; &lt;sup id="fnref2"&gt;2&lt;/sup&gt;
&lt;/td&gt;
&lt;td&gt;Runtime introspection (&lt;code&gt;tools/list&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Execution&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;LLM generates HTTP requests (error-prone)&lt;/td&gt;
&lt;td&gt;LLM picks tool, deterministic code runs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Direction&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Typically client-initiated; server-push exists but not standardized&lt;/td&gt;
&lt;td&gt;Bidirectional as first-class feature&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Local access&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Requires port, auth, CORS setup&lt;/td&gt;
&lt;td&gt;Native stdio support for desktop tools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Training target&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Impractical at scale due to heterogeneity&lt;/td&gt;
&lt;td&gt;Single protocol enables model fine-tuning&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The HTTP API Problem
&lt;/h2&gt;

&lt;p&gt;HTTP APIs suffer from combinatorial chaos. To send data to an endpoint, you might encode it in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;URL path (&lt;code&gt;/users/123&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Request headers (&lt;code&gt;X-User-Id: 123&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Query parameters (&lt;code&gt;?userId=123&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Request body (JSON, XML, form-encoded, CSV)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;OpenAPI/Swagger documents these variations, but as a specification format, it describes existing patterns rather than enforcing consistency. Building automated tools to reliably use arbitrary APIs remains hard because HTTP wasn't designed for this—it was the only cross-platform, firewall-friendly transport universally available from browsers.&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP: A Wire Protocol, Not Documentation
&lt;/h2&gt;

&lt;p&gt;Model Context Protocol (MCP) isn't another API standard—it's a wire protocol that enforces consistency. While OpenAPI documents existing interfaces with their variations, MCP mandates specific patterns: JSON-RPC 2.0 transport, single input schema per tool, deterministic execution.&lt;/p&gt;

&lt;p&gt;Key architecture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Transport&lt;/strong&gt;: stdio (local) or &lt;a href="https://modelcontextprotocol.io/specification/2025-03-26/basic/transports#streamable-http" rel="noopener noreferrer"&gt;streamable HTTP&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discovery&lt;/strong&gt;: &lt;code&gt;tools/list&lt;/code&gt;, &lt;code&gt;resources/list&lt;/code&gt; expose capabilities at runtime&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Primitives&lt;/strong&gt;: Tools (actions), Resources (read-only data), Prompts (templates)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why Not Just Use OpenAPI?
&lt;/h2&gt;

&lt;p&gt;The most common question: "Why not extend OpenAPI with AI-specific features?"&lt;/p&gt;

&lt;p&gt;Three reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;OpenAPI describes; MCP prescribes&lt;/strong&gt;. You can't fix inconsistency by documenting it better—you need enforcement at the protocol level.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrofitting fails at scale&lt;/strong&gt;. OpenAPI would need to standardize transport, mandate single-location inputs, require specific schemas, add bidirectional primitives—essentially becoming a different protocol.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The ecosystem problem&lt;/strong&gt;. Even if OpenAPI added these features tomorrow, millions of existing APIs wouldn't adopt them. MCP starts fresh with AI-first principles.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Five Fundamental Differences
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Runtime Discovery vs Static Specs
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;API&lt;/strong&gt;: Ship new client code when endpoints change&lt;br&gt;&lt;br&gt;
&lt;strong&gt;MCP&lt;/strong&gt;: Agents query capabilities dynamically and adapt automatically&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// MCP discovery - works with any server
client.request('tools/list')
// Returns all available tools with schemas
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Deterministic Execution vs LLM-Generated Calls
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;API&lt;/strong&gt;: LLM writes the HTTP request → hallucinated paths, wrong parameters&lt;br&gt;&lt;br&gt;
&lt;strong&gt;MCP&lt;/strong&gt;: LLM picks which tool → wrapped code executes deterministically&lt;/p&gt;

&lt;p&gt;This distinction is critical for production safety. With MCP, you can test, sanitize inputs, and handle errors in actual code, not hope the LLM formats requests correctly.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Bidirectional Communication
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;API&lt;/strong&gt;: Server-push exists (WebSockets, SSE, GraphQL subscriptions) but lacks standardization&lt;br&gt;&lt;br&gt;
&lt;strong&gt;MCP&lt;/strong&gt;: Bidirectional communication as first-class feature:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Request LLM completions from server&lt;/li&gt;
&lt;li&gt;Ask users for input (&lt;a href="https://modelcontextprotocol.io/specification/draft/client/elicitation" rel="noopener noreferrer"&gt;elicitation&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Push progress notifications&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  4. Single-Request Human Tasks
&lt;/h3&gt;

&lt;p&gt;REST APIs fragment human tasks across endpoints. Creating a calendar event might require:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;POST /events&lt;/code&gt; (create)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GET /conflicts&lt;/code&gt; (check)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;POST /invitations&lt;/code&gt; (notify)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;MCP tools map to complete workflows. One tool, one human task.&lt;/p&gt;
&lt;h3&gt;
  
  
  5. Local-First by Design
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;API&lt;/strong&gt;: Requires HTTP server (port binding, CORS, auth headers)&lt;br&gt;&lt;br&gt;
&lt;strong&gt;MCP&lt;/strong&gt;: Can run as local process via stdio—no network layer needed&lt;/p&gt;

&lt;p&gt;Why this matters: When MCP servers run locally via stdio, they inherit the host process's permissions.&lt;/p&gt;

&lt;p&gt;This enables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Direct filesystem access (read/write files)&lt;/li&gt;
&lt;li&gt;Terminal command execution&lt;/li&gt;
&lt;li&gt;System-level operations&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;[!NOTE]&lt;br&gt;
A local HTTP server could provide the same capabilities. However, I think the fact that MCP led with &lt;code&gt;stdio&lt;/code&gt; transport planted the idea that MCP servers are meant to be as local services, which is not how we typically think of APIs.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  The Training Advantage
&lt;/h2&gt;

&lt;p&gt;MCP's standardization creates a future opportunity: models could be trained on a single, consistent protocol rather than thousands of API variations. While models today use MCP through existing function-calling capabilities, the protocol's uniformity offers immediate practical benefits:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consistent patterns across all servers:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Discovery: &lt;code&gt;tools/list&lt;/code&gt;, &lt;code&gt;resources/list&lt;/code&gt;, &lt;code&gt;prompts/list&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Execution: &lt;code&gt;tools/call&lt;/code&gt; with single JSON argument object&lt;/li&gt;
&lt;li&gt;Errors: Standard JSON-RPC format with numeric codes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Reduced cognitive load for models:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Every&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;MCP&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;tool&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;follows&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;same&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;pattern:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tools/call"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"params"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"github.search_prs"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"arguments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"security"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"state"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"open"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Versus&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;REST&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;APIs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;with&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;endless&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;variations:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;GET&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;/api/v&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="err"&gt;/search?q=security&amp;amp;type=pr&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;POST&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;/graphql&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{ search(query: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;security&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;) { ... } }"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;GET&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;/repos/owner/repo/pulls?state=open&amp;amp;search=security&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This standardization means models need to learn one calling convention instead of inferring patterns from documentation. As MCP adoption grows, future models could be specifically optimized for the protocol, similar to how models today are trained on function-calling formats.&lt;/p&gt;

&lt;h2&gt;
  
  
  They're Layers, Not Competitors
&lt;/h2&gt;

&lt;p&gt;Most MCP servers wrap existing APIs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[AI Agent] ⟷ MCP Client ⟷ MCP Server ⟷ REST API ⟷ Service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;mcp-github&lt;/code&gt; server translates &lt;code&gt;repository/list&lt;/code&gt; into GitHub REST calls. You keep battle-tested infrastructure while adding AI-friendly ergonomics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Example
&lt;/h2&gt;

&lt;p&gt;Consider a task: "Find all pull requests mentioning security issues and create a summary report."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With OpenAPI/REST&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;LLM reads API docs, generates: &lt;code&gt;GET /repos/{owner}/{repo}/pulls?state=all&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Hopes it formatted the request correctly&lt;/li&gt;
&lt;li&gt;Parses response, generates: &lt;code&gt;GET /repos/{owner}/{repo}/pulls/{number}&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Repeats for each PR (rate limiting issues)&lt;/li&gt;
&lt;li&gt;Generates search queries for comments&lt;/li&gt;
&lt;li&gt;Assembles report&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;With MCP&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;LLM calls: &lt;code&gt;github.search_issues_and_prs({query: "security", type: "pr"})&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Deterministic code handles pagination, rate limits, error retry&lt;/li&gt;
&lt;li&gt;Returns structured data&lt;/li&gt;
&lt;li&gt;LLM focuses on analysis, not API mechanics&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;HTTP APIs evolved to serve human developers and browser-based applications, not AI agents. MCP addresses AI-specific requirements from the ground up: runtime discovery, deterministic execution, and bidirectional communication.&lt;/p&gt;

&lt;p&gt;For AI-first applications, MCP provides structural advantages—local execution, server-initiated flows, and guaranteed tool reliability—that would require significant workarounds in traditional API architectures. The practical path forward involves using both: maintaining APIs for human developers while adding MCP for AI agent integration.&lt;/p&gt;




&lt;ol&gt;

&lt;li id="fn1"&gt;
&lt;p&gt;GraphQL offers schema introspection, but it lacks task-level descriptions or JSON-schema-style validation, so SDKs still regenerate for new fields. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn2"&gt;
&lt;p&gt;OpenAPI 3.1+ supports runtime discovery through the OpenAPI document endpoint. The key difference is that MCP mandates runtime discovery while OpenAPI makes it optional. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;/ol&gt;

</description>
      <category>mcp</category>
      <category>ai</category>
      <category>api</category>
    </item>
    <item>
      <title>NLWeb: Microsoft's Protocol for AI-Powered Website Search</title>
      <dc:creator>Frank Fiegel</dc:creator>
      <pubDate>Wed, 04 Jun 2025 17:57:44 +0000</pubDate>
      <link>https://forem.com/punkpeye/nlweb-microsofts-protocol-for-ai-powered-website-search-ng8</link>
      <guid>https://forem.com/punkpeye/nlweb-microsofts-protocol-for-ai-powered-website-search-ng8</guid>
      <description>&lt;p&gt;Microsoft recently open-sourced &lt;a href="https://github.com/microsoft/NLWeb" rel="noopener noreferrer"&gt;NLWeb&lt;/a&gt;, a protocol for adding conversational interfaces to websites.&lt;sup id="fnref1"&gt;1&lt;/sup&gt; It leverages &lt;a href="https://schema.org/" rel="noopener noreferrer"&gt;Schema.org&lt;/a&gt; structured data that many sites already have and includes built-in support for MCP (Model Context Protocol), enabling both human conversations and agent-to-agent communication.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The key idea:&lt;/strong&gt; NLWeb creates a standard protocol that turns any website into a conversational interface that both humans and AI agents can query naturally.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Problem Does NLWeb Solve?
&lt;/h2&gt;

&lt;p&gt;Currently, websites have structured data (Schema.org) but no standard way for AI agents or conversational interfaces to access it. Every implementation is bespoke. Traditional search interfaces struggle with context-aware, multi-turn queries.&lt;/p&gt;

&lt;p&gt;NLWeb creates a standard protocol for conversational access to web content. Like RSS did for syndication, NLWeb does for AI interactions - one implementation serves both human chat interfaces and programmatic agent access.&lt;/p&gt;

&lt;p&gt;The key insight: Instead of building custom NLP for every site, NLWeb leverages LLMs' existing understanding of Schema.org to create instant conversational interfaces.&lt;/p&gt;

&lt;p&gt;The real power comes from multi-turn conversations that preserve context:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;"Find recipes for dinner parties"&lt;/li&gt;
&lt;li&gt;"Only vegetarian options"
&lt;/li&gt;
&lt;li&gt;"That can be prepared in under an hour"&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each query builds on the previous context - something traditional search interfaces struggle with.&lt;/p&gt;

&lt;h2&gt;
  
  
  How NLWeb Works
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Two-Component System
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Protocol Layer&lt;/strong&gt;: REST API (&lt;code&gt;/ask&lt;/code&gt; endpoint) and MCP server (&lt;code&gt;/mcp&lt;/code&gt; endpoint) that accept natural language queries and return Schema.org JSON responses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implementation Layer&lt;/strong&gt;: Reference implementation that orchestrates multiple LLM calls for query processing&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Query Processing Pipeline
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Query → Parallel Pre-processing → Vector Retrieval → LLM Ranking → Response
             ├─ Relevancy Check
             ├─ Decontextualization  
             ├─ Memory Detection
             └─ Fast Track Path
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this flow, a single query may trigger 50+ targeted LLM calls for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Query decontextualization based on conversation history&lt;/li&gt;
&lt;li&gt;Relevancy scoring against site content&lt;/li&gt;
&lt;li&gt;Result ranking with custom prompts per content type&lt;/li&gt;
&lt;li&gt;Optional post-processing (summarization/generation)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The "fast track" optimization launches a parallel path to retrieval (step 3) while pre-processing occurs, but results are blocked until relevancy checks complete&lt;sup id="fnref2"&gt;2&lt;/sup&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why 50+ LLM Calls?
&lt;/h3&gt;

&lt;p&gt;Instead of using one large prompt to handle everything, NLWeb breaks each query into dozens of small, specific questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Is this query about recipes?"&lt;/li&gt;
&lt;li&gt;"Does it reference something mentioned earlier?"&lt;/li&gt;
&lt;li&gt;"Is the user asking to remember dietary preferences?"&lt;/li&gt;
&lt;li&gt;"How relevant is this specific result?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This approach has two major benefits:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No hallucination&lt;/strong&gt; - Results only come from your actual database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Better accuracy&lt;/strong&gt; - Each LLM call has one clear job it can do well&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Think of it like having a team of specialists instead of one generalist.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Even if you don't use NLWeb, this pattern—using many focused LLM calls instead of one complex prompt—is worth borrowing.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;

&lt;p&gt;The best way to wrap your head around NLWeb is to try it out.&lt;/p&gt;

&lt;p&gt;Microsoft provides a &lt;a href="https://github.com/microsoft/NLWeb/blob/main/docs/nlweb-hello-world.md" rel="noopener noreferrer"&gt;quick start guide&lt;/a&gt; for setting up an example NLWeb server with &lt;a href="https://www.microsoft.com/en-us/behind-the-tech" rel="noopener noreferrer"&gt;Behind The Tech&lt;/a&gt; RSS feed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Setup&lt;/span&gt;
git clone https://github.com/microsoft/NLWeb
&lt;span class="nb"&gt;cd &lt;/span&gt;NLWeb
python &lt;span class="nt"&gt;-m&lt;/span&gt; venv myenv
&lt;span class="nb"&gt;source &lt;/span&gt;myenv/bin/activate
&lt;span class="nb"&gt;cd &lt;/span&gt;code
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="c"&gt;# Configure (copy .env.template → .env, update API keys)&lt;/span&gt;

&lt;span class="c"&gt;# Load data&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; tools.db_load https://feeds.libsyn.com/121695/rss Behind-the-Tech

&lt;span class="c"&gt;# Run&lt;/span&gt;
python app-file.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Go to &lt;a href="http://localhost:8000/" rel="noopener noreferrer"&gt;localhost:8000&lt;/a&gt; and you should have a working NLWeb server.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I have also noticed that the repository contains a &lt;a href="https://github.com/microsoft/NLWeb/blob/main/docs/nlweb-cli.md" rel="noopener noreferrer"&gt;CLI&lt;/a&gt; to simplify configuration, testing, and execution of the application. However, I struggled to get it working.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once you have the server running, you can ask it questions like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{
    "query": "tell me more about the first one",
    "prev": "find podcasts about AI,what topics do they cover"
  }'
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;which will return a JSON response like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"query_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"abc123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"results"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AI Safety with Stuart Russell"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Discussion on alignment challenges..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"schema_object"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"@type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"PodcastEpisode"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Glama NLWeb Server
&lt;/h3&gt;

&lt;p&gt;As part of writing this post, I've built a simple NLWeb server using Node.js. You can use it to query our &lt;a href="https://glama.ai/mcp/servers" rel="noopener noreferrer"&gt;MCP server&lt;/a&gt; directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://glama.ai/nlweb/ask &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"query": "MCP servers for working with GitHub"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;As far as I can tell, this is the first ever public NLWeb endpoint!&lt;/p&gt;

&lt;p&gt;Due to the volume of LLM calls, it takes a few seconds to respond.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;or, if you want to continue the conversation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://glama.ai/nlweb/ask &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "query": "servers that can create PRs",
    "prev": "MCP servers for working with GitHub"
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or, if you want to summarize the results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://glama.ai/nlweb/ask &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "query": "MCP servers for working with GitHub",
    "mode": "summarize"
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Useful when you want an overview rather than just a list of results.&lt;/p&gt;

&lt;p&gt;or, if you want to generate a response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://glama.ai/nlweb/ask &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "query": "MCP servers for working with GitHub",
    "mode": "generate"
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This mode attempts to answer the question using the retrieved results (like traditional RAG)&lt;/p&gt;

&lt;p&gt;Things that made it easy to implement:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We have existing embeddings for every MCP server and a vector store&lt;/li&gt;
&lt;li&gt;We already have a way to make LLM calls&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A few questions came to mind as I was implementing this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It seems that NLWeb doesn't dictate where the &lt;code&gt;/ask&lt;/code&gt; endpoint needs to be hosted—does it have to be &lt;code&gt;https://glama.ai/ask&lt;/code&gt; or can it be &lt;code&gt;https://glama.ai/nlweb/ask&lt;/code&gt;?&lt;/li&gt;
&lt;li&gt;It wasn't super clear to me which Schema.org data is best suited to describe MCP servers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not surprisingly, the slowest part of the pipeline is the LLM calls.&lt;/p&gt;

&lt;h2&gt;
  
  
  REST API
&lt;/h2&gt;

&lt;p&gt;Currently, NLWeb supports two APIs at the endpoints &lt;code&gt;/ask&lt;/code&gt; and &lt;code&gt;/mcp&lt;/code&gt;. The arguments are the same for both, as is most of the functionality. The &lt;code&gt;/mcp&lt;/code&gt; endpoint returns the answers in format that MCP clients can use. The &lt;code&gt;/mcp&lt;/code&gt; endpoint also supports the core MCP methods (&lt;code&gt;list_tools&lt;/code&gt;, &lt;code&gt;list_prompts&lt;/code&gt;, &lt;code&gt;call_tool&lt;/code&gt; and &lt;code&gt;get_prompt&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;/ask&lt;/code&gt; endpoint supports the following parameters:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;query&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;string&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Natural language question&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;site&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;string&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Scope to specific data subset&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;prev&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;string&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Comma-separated previous queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;decontextualized_query&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;string&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Skip decontextualization if provided&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;streaming&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;bool&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Enable SSE streaming&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;query_id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;string&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Track conversation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;mode&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;string&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;list&lt;/code&gt;, &lt;code&gt;summarize&lt;/code&gt;, or &lt;code&gt;generate&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Integrating with MCP
&lt;/h2&gt;

&lt;p&gt;Since NLWeb includes an MCP server by default, you can configure Claude for Desktop to talk to NLWeb.&lt;/p&gt;

&lt;p&gt;If you already have the NLWeb server running, this should be as simple as adding the following to your &lt;code&gt;~/Library/Application Support/Claude/claude_desktop_config.json&lt;/code&gt; configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ask_nlw"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/Users/yourname/NLWeb/myenv/bin/python"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"/Users/yourname/NLWeb/code/chatbot_interface.py"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"--server"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:8000"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"--endpoint"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"/mcp"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"cwd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/Users/yourname/NLWeb/code"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Implementation Reality
&lt;/h2&gt;

&lt;p&gt;The documentation suggests you can get a basic prototype running quickly if you have existing Schema.org markup or RSS feeds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's actually straightforward:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Loading RSS feeds or Schema.org data&lt;/li&gt;
&lt;li&gt;Basic search functionality with provided prompts&lt;/li&gt;
&lt;li&gt;Local development with Qdrant&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What requires more effort:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Production deployment at scale
&lt;/li&gt;
&lt;li&gt;Optimizing 50+ LLM calls per query (mentioned in docs)&lt;/li&gt;
&lt;li&gt;Custom prompt engineering for your domain&lt;/li&gt;
&lt;li&gt;Maintaining data freshness between vector store and live data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I already had a lot of these components in place, so I was able to get a basic prototype running in an hour. However, to make this production-ready, I'd need to think a lot more time about the cost of the LLM calls.&lt;/p&gt;

&lt;h2&gt;
  
  
  Should You Care?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Yes if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have structured data (Schema.org, RSS) already&lt;/li&gt;
&lt;li&gt;You want to enable conversational search beyond keywords&lt;/li&gt;
&lt;li&gt;You need programmatic AI agent access via MCP&lt;/li&gt;
&lt;li&gt;You can experiment with early-stage tech&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;No if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need battle-tested production code&lt;/li&gt;
&lt;li&gt;You can't handle significant LLM API costs&lt;/li&gt;
&lt;li&gt;Your content isn't well-structured&lt;/li&gt;
&lt;li&gt;You expect plug-and-play simplicity&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Bottom Line
&lt;/h2&gt;

&lt;p&gt;NLWeb is more interesting as a strategic direction than as current technology. NLWeb was conceived and developed by R.V. Guha (creator of Schema.org, RSS, and RDF), now a CVP and Technical Fellow at Microsoft&lt;sup id="fnref3"&gt;3&lt;/sup&gt;. That's serious pedigree.&lt;/p&gt;

&lt;p&gt;The O'Reilly prototype proves it's viable for content-heavy sites. The quick start shows it's approachable for developers. But "prototype in days" doesn't mean "production in weeks."&lt;/p&gt;

&lt;p&gt;Think of it as an investment in making your content natively conversational. The technical foundation is solid—REST API, standard formats, proven vector stores. The vision is compelling. The code needs work.&lt;/p&gt;

&lt;p&gt;Want to experiment? Clone the repo and try the quick start above.&lt;/p&gt;




&lt;ol&gt;

&lt;li id="fn1"&gt;
&lt;p&gt;&lt;a href="https://news.microsoft.com/source/features/company-news/introducing-nlweb-bringing-conversational-interfaces-directly-to-the-web/" rel="noopener noreferrer"&gt;https://news.microsoft.com/source/features/company-news/introducing-nlweb-bringing-conversational-interfaces-directly-to-the-web/&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn2"&gt;
&lt;p&gt;&lt;a href="https://github.com/microsoft/NLWeb" rel="noopener noreferrer"&gt;https://github.com/microsoft/NLWeb&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn3"&gt;
&lt;p&gt;&lt;a href="https://techcommunity.microsoft.com/blog/azure-ai-services-blog/nlweb-pioneer-qa-oreilly/4415299" rel="noopener noreferrer"&gt;https://techcommunity.microsoft.com/blog/azure-ai-services-blog/nlweb-pioneer-qa-oreilly/4415299&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;/ol&gt;

</description>
      <category>microsoft</category>
      <category>seo</category>
      <category>ai</category>
    </item>
    <item>
      <title>Claude Sonnet and Opus 4 (Executive Summary)</title>
      <dc:creator>Frank Fiegel</dc:creator>
      <pubDate>Thu, 22 May 2025 18:19:44 +0000</pubDate>
      <link>https://forem.com/punkpeye/claude-sonnet-and-opus-4-executive-summary-3h2j</link>
      <guid>https://forem.com/punkpeye/claude-sonnet-and-opus-4-executive-summary-3h2j</guid>
      <description>&lt;p&gt;Anthropic released Claude Opus 4 and Sonnet 4 today, claiming the #1 spot for coding performance. There are going to be a lot of articles floating around with exaggerations and marketing talk, but here is an executive summary of everything you need to know.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance Numbers
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Claude Opus 4:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SWE-bench: 72.5% (world's best)&lt;/li&gt;
&lt;li&gt;Terminal-bench: 43.2% &lt;/li&gt;
&lt;li&gt;Sustained performance for hours on complex tasks&lt;/li&gt;
&lt;li&gt;$15/$75 per million tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Claude Sonnet 4:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SWE-bench: 72.7% (matches Opus 4)&lt;/li&gt;
&lt;li&gt;3x faster than Opus 4 for most tasks&lt;/li&gt;
&lt;li&gt;$3/$15 per million tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Two key slides from the announcement:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcq7rwl0lb8czdd0270wv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcq7rwl0lb8czdd0270wv.png" alt=" " width="800" height="481"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flmd5y7vc73o7sb0cnicc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flmd5y7vc73o7sb0cnicc.png" alt=" " width="800" height="653"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Technical Features
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid Architecture&lt;/strong&gt;: Instant responses + extended thinking mode (up to 64K tokens)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extended Thinking with Tools&lt;/strong&gt;: Can use web search, code execution during reasoning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallel Tool Execution&lt;/strong&gt;: Multiple tools simultaneously
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory Files&lt;/strong&gt;: Creates persistent memory when given file access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;65% Reduction&lt;/strong&gt;: Less shortcut/loopholes behavior vs Sonnet 3.7&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Industry Adoption
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: Integrating Sonnet 4 into GitHub Copilot&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cursor&lt;/strong&gt;: "State-of-the-art for coding"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rakuten&lt;/strong&gt;: Validated 7-hour autonomous refactor&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sourcegraph&lt;/strong&gt;: "Substantial leap in software development"&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  New API Capabilities
&lt;/h2&gt;

&lt;p&gt;4 &lt;a href="https://www.anthropic.com/news/agent-capabilities-api" rel="noopener noreferrer"&gt;new capabilities&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code execution tool&lt;/li&gt;
&lt;li&gt;MCP connector
&lt;/li&gt;
&lt;li&gt;Files API&lt;/li&gt;
&lt;li&gt;Prompt caching (1 hour)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Claude Code Generally Available
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;VS Code and JetBrains extensions (beta)&lt;/li&gt;
&lt;li&gt;GitHub Actions integration (&lt;a href="https://www.youtube.com/watch?v=L_WFEgry87M" rel="noopener noreferrer"&gt;demo&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Claude Code SDK for custom agents&lt;/li&gt;
&lt;li&gt;GitHub PR integration via &lt;code&gt;/install-github-app&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Access
&lt;/h2&gt;

&lt;p&gt;Already available via &lt;a href="https://www.anthropic.com/api" rel="noopener noreferrer"&gt;Anthropic API&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you want to skip the new model restrictions, you can try it via &lt;a href="https://glama.ai/gateway" rel="noopener noreferrer"&gt;Glama Gateway&lt;/a&gt; and OpenRouter.&lt;/p&gt;

&lt;h2&gt;
  
  
  So, is it hype?
&lt;/h2&gt;

&lt;p&gt;Claude 4 models lead coding benchmarks and offer sustained performance for complex agent workflows. Opus 4 for maximum capability, Sonnet 4 for speed/cost balance. Both already available to test.&lt;/p&gt;

&lt;p&gt;Source: &lt;a href="https://www.anthropic.com/news/claude-4" rel="noopener noreferrer"&gt;Official Announcement&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Will update this article to add interesting insights and facts as the day progresses.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>webdev</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Top 100 MCP searches Q1 2025</title>
      <dc:creator>Frank Fiegel</dc:creator>
      <pubDate>Mon, 14 Apr 2025 08:31:53 +0000</pubDate>
      <link>https://forem.com/punkpeye/top-100-mcp-searches-q1-2025-56fm</link>
      <guid>https://forem.com/punkpeye/top-100-mcp-searches-q1-2025-56fm</guid>
      <description>&lt;p&gt;Data from &lt;a href="https://glama.ai/mcp/servers" rel="noopener noreferrer"&gt;https://glama.ai/mcp/servers&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;search&lt;/th&gt;
&lt;th&gt;count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=supabase" rel="noopener noreferrer"&gt;supabase&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;2106&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=github" rel="noopener noreferrer"&gt;github&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;1570&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=playwright" rel="noopener noreferrer"&gt;playwright&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;1398&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=docker" rel="noopener noreferrer"&gt;docker&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;1186&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=browser" rel="noopener noreferrer"&gt;browser&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;1176&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=obsidian" rel="noopener noreferrer"&gt;obsidian&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;980&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=notion" rel="noopener noreferrer"&gt;notion&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;944&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=postgres" rel="noopener noreferrer"&gt;postgres&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;920&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=search" rel="noopener noreferrer"&gt;search&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;720&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=filesystem" rel="noopener noreferrer"&gt;filesystem&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;704&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=sequential" rel="noopener noreferrer"&gt;sequential&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;650&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=browser%20tools" rel="noopener noreferrer"&gt;browser tools&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;604&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=google" rel="noopener noreferrer"&gt;google&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;538&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=perplexity" rel="noopener noreferrer"&gt;perplexity&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;494&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=memory" rel="noopener noreferrer"&gt;memory&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;442&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=youtube" rel="noopener noreferrer"&gt;youtube&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;440&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=reddit" rel="noopener noreferrer"&gt;reddit&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;436&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=blender" rel="noopener noreferrer"&gt;blender&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;432&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=cursor" rel="noopener noreferrer"&gt;cursor&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;418&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=firebase" rel="noopener noreferrer"&gt;firebase&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;400&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=wordpress" rel="noopener noreferrer"&gt;wordpress&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;390&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=firecrawl" rel="noopener noreferrer"&gt;firecrawl&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;370&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=sequential%20thinking" rel="noopener noreferrer"&gt;sequential thinking&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;366&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=weather" rel="noopener noreferrer"&gt;weather&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;356&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=confluence" rel="noopener noreferrer"&gt;confluence&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;334&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=linear" rel="noopener noreferrer"&gt;linear&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;322&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=puppeteer" rel="noopener noreferrer"&gt;puppeteer&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;318&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=salesforce" rel="noopener noreferrer"&gt;salesforce&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;300&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=postgresql" rel="noopener noreferrer"&gt;postgresql&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;300&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=linkedin" rel="noopener noreferrer"&gt;linkedin&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;296&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=gitlab" rel="noopener noreferrer"&gt;gitlab&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;294&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=thinking" rel="noopener noreferrer"&gt;thinking&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;286&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=spotify" rel="noopener noreferrer"&gt;spotify&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;284&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=kubernetes" rel="noopener noreferrer"&gt;kubernetes&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;264&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=shopify" rel="noopener noreferrer"&gt;shopify&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;260&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=flutter" rel="noopener noreferrer"&gt;flutter&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;258&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=google%20drive" rel="noopener noreferrer"&gt;google drive&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;254&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=calendar" rel="noopener noreferrer"&gt;calendar&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;250&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=atlassian" rel="noopener noreferrer"&gt;atlassian&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;246&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=sqlite" rel="noopener noreferrer"&gt;sqlite&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;228&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=whatsapp" rel="noopener noreferrer"&gt;whatsapp&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;224&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=twitter" rel="noopener noreferrer"&gt;twitter&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;222&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=Supabase" rel="noopener noreferrer"&gt;Supabase&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;218&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=google%20calendar" rel="noopener noreferrer"&gt;google calendar&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;212&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=documentation" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;206&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=browsertools" rel="noopener noreferrer"&gt;browsertools&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;202&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=terminal" rel="noopener noreferrer"&gt;terminal&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;198&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=python" rel="noopener noreferrer"&gt;python&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;198&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=Sequential%20Thinking" rel="noopener noreferrer"&gt;Sequential Thinking&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;194&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=chrome" rel="noopener noreferrer"&gt;chrome&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;184&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=openapi" rel="noopener noreferrer"&gt;openapi&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;184&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=telegram" rel="noopener noreferrer"&gt;telegram&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;174&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=tavily" rel="noopener noreferrer"&gt;tavily&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;174&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=file%20system" rel="noopener noreferrer"&gt;file system&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;172&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=cloudflare" rel="noopener noreferrer"&gt;cloudflare&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;170&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=discord" rel="noopener noreferrer"&gt;discord&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;164&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=browser%20use" rel="noopener noreferrer"&gt;browser use&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;158&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=instagram" rel="noopener noreferrer"&gt;instagram&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;156&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=Playwright" rel="noopener noreferrer"&gt;Playwright&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;154&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=datadog" rel="noopener noreferrer"&gt;datadog&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;152&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=crypto" rel="noopener noreferrer"&gt;crypto&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;152&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=fetch" rel="noopener noreferrer"&gt;fetch&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;150&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=laravel" rel="noopener noreferrer"&gt;laravel&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;150&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=mongodb" rel="noopener noreferrer"&gt;mongodb&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;148&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=outlook" rel="noopener noreferrer"&gt;outlook&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;146&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=airtable" rel="noopener noreferrer"&gt;airtable&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;140&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=snowflake" rel="noopener noreferrer"&gt;snowflake&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;138&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=postgre" rel="noopener noreferrer"&gt;postgre&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;138&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=database" rel="noopener noreferrer"&gt;database&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;136&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=markdown" rel="noopener noreferrer"&gt;markdown&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;136&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=web%20search" rel="noopener noreferrer"&gt;web search&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;136&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=browser%20tool" rel="noopener noreferrer"&gt;browser tool&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;134&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=stripe" rel="noopener noreferrer"&gt;stripe&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=sql%20server" rel="noopener noreferrer"&gt;sql server&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;126&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=hubspot" rel="noopener noreferrer"&gt;hubspot&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;126&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=powerpoint" rel="noopener noreferrer"&gt;powerpoint&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;126&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=ollama" rel="noopener noreferrer"&gt;ollama&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;124&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=bitbucket" rel="noopener noreferrer"&gt;bitbucket&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;124&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=microsoft" rel="noopener noreferrer"&gt;microsoft&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;122&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=clickhouse" rel="noopener noreferrer"&gt;clickhouse&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;122&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=clickup" rel="noopener noreferrer"&gt;clickup&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;120&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;a href="https://glama.ai/mcp/servers?query=browser" rel="noopener noreferrer"&gt;browser&lt;/a&gt; tools&lt;/td&gt;
&lt;td&gt;118&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=claude" rel="noopener noreferrer"&gt;claude&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;118&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=deepseek" rel="noopener noreferrer"&gt;deepseek&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;116&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=Notion" rel="noopener noreferrer"&gt;Notion&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;116&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=Obsidian" rel="noopener noreferrer"&gt;Obsidian&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;114&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=docker%20mcp" rel="noopener noreferrer"&gt;docker mcp&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;112&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=mcp%20server%20fetch" rel="noopener noreferrer"&gt;mcp server fetch&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;112&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=postman" rel="noopener noreferrer"&gt;postman&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;112&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=terraform" rel="noopener noreferrer"&gt;terraform&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;110&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=vercel" rel="noopener noreferrer"&gt;vercel&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;110&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=replicate" rel="noopener noreferrer"&gt;replicate&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;110&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=research" rel="noopener noreferrer"&gt;research&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;108&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=brave%20search" rel="noopener noreferrer"&gt;brave search&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;108&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=Documentation" rel="noopener noreferrer"&gt;Documentation&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;108&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=trello" rel="noopener noreferrer"&gt;trello&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;108&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=oracle" rel="noopener noreferrer"&gt;oracle&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;106&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=todoist" rel="noopener noreferrer"&gt;todoist&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;106&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=android" rel="noopener noreferrer"&gt;android&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;106&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://glama.ai/mcp/servers?query=screenshot" rel="noopener noreferrer"&gt;screenshot&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;104&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

</description>
      <category>mcp</category>
      <category>ai</category>
    </item>
    <item>
      <title>GPT-4.5 Announced: How to Access the Latest OpenAI Model Without Rate Limits</title>
      <dc:creator>Frank Fiegel</dc:creator>
      <pubDate>Thu, 27 Feb 2025 21:49:12 +0000</pubDate>
      <link>https://forem.com/punkpeye/gpt-45-announced-how-to-access-the-latest-openai-model-without-rate-limits-202l</link>
      <guid>https://forem.com/punkpeye/gpt-45-announced-how-to-access-the-latest-openai-model-without-rate-limits-202l</guid>
      <description>&lt;p&gt;OpenAI's latest research preview, GPT-4.5, marks a substantial leap forward in unsupervised learning, combining broader world knowledge, greater emotional intelligence ("EQ"), and vastly improved intuition to support natural and nuanced conversation. Designed with innovative scalable training techniques, GPT-4.5 excels in creative tasks, coding workflows, and meaningful interactions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4djepgxribv1pgnyk45i.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4djepgxribv1pgnyk45i.jpeg" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Available Servers to Access GPT-4.5
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Anthropic Official: &lt;a href="https://openai.com/api/" rel="noopener noreferrer"&gt;https://openai.com/api/&lt;/a&gt; — Official platform but includes rate limitations.&lt;/li&gt;
&lt;li&gt;Glama AI: &lt;a href="https://glama.ai/models/gpt-4.5-preview-2025-02-27" rel="noopener noreferrer"&gt;GPT-4.5 Preview&lt;/a&gt; — Offers GPT-4.5 without rate limits, easy sign-up within 30 seconds.&lt;/li&gt;
&lt;li&gt;OpenRouter: &lt;a href="https://openrouter.ai/openai/gpt-4.5-preview" rel="noopener noreferrer"&gt;GPT-4.5&lt;/a&gt; — Another reliable platform without rate restrictions providing easy access.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key Highlights from Official OpenAI Announcement
&lt;/h2&gt;

&lt;p&gt;OpenAI describes GPT-4.5 as a major step forward in scaling unsupervised learning. Surpassing previous generations in intuitive world knowledge, GPT-4.5 has significant reductions in hallucination rates and far superior accuracy in general knowledge queries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Improved Natural Interactions
&lt;/h3&gt;

&lt;p&gt;Early feedback indicates that GPT-4.5 interactions feel notably smoother and more organic, with stronger emotional intelligence (“EQ”) enabling more nuanced conversational exchanges.&lt;/p&gt;

&lt;h3&gt;
  
  
  Enhanced Capabilities at a Glance:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Higher factual accuracy across diverse topics (62.5% SimpleQA accuracy vs GPT-4o’s 38.2%).&lt;/li&gt;
&lt;li&gt;Reduced hallucination rate (37.1% vs GPT-4o’s 61.8%).&lt;/li&gt;
&lt;li&gt;Better collaborative intelligence, intuitive interactions, and creativity.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Key Technical Advances:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Marked improvement in unsupervised learning, enhancing intuition and broadening world-model accuracy.&lt;/li&gt;
&lt;li&gt;New scalable training techniques enable GPT-4.5 to better understand and anticipate human intent, resulting in unprecedented conversational fluidity.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;--&lt;/p&gt;

&lt;h2&gt;
  
  
  Immediate Use Cases
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Creative and Writing Assistance: Exceptional at creative writing, content editing, and design tasks.&lt;/li&gt;
&lt;li&gt;Programming and Automation Workflows: Superb capabilities in executing complex coding and multi-step automation.&lt;/li&gt;
&lt;li&gt;Interpersonal Communication: Stronger emotional intelligence makes GPT-4.5 ideal for empathetic engagements, coaching scenarios, and even mental wellness interactions.
How to Access GPT-4.5 Now&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Available Providers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Anthropic Official: &lt;a href="https://openai.com/api/" rel="noopener noreferrer"&gt;https://openai.com/api/&lt;/a&gt; — Official platform but includes rate limitations.&lt;/li&gt;
&lt;li&gt;Glama AI: &lt;a href="https://glama.ai/models/gpt-4.5-preview-2025-02-27" rel="noopener noreferrer"&gt;GPT-4.5 Preview&lt;/a&gt; — Offers GPT-4.5 without rate limits, easy sign-up within 30 seconds.&lt;/li&gt;
&lt;li&gt;OpenRouter: &lt;a href="https://openrouter.ai/openai/gpt-4.5-preview" rel="noopener noreferrer"&gt;GPT-4.5&lt;/a&gt; — Another reliable platform without rate restrictions providing easy access.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>llm</category>
      <category>rag</category>
    </item>
    <item>
      <title>How to access DeepSeek r1?</title>
      <dc:creator>Frank Fiegel</dc:creator>
      <pubDate>Wed, 29 Jan 2025 18:25:59 +0000</pubDate>
      <link>https://forem.com/punkpeye/how-to-access-deepseek-r1-44na</link>
      <guid>https://forem.com/punkpeye/how-to-access-deepseek-r1-44na</guid>
      <description>&lt;p&gt;Many people are struggling to get access to DeepSeek r1 at the moment because of the rate limits and restricted sign ups. However, there are alternative providers that you can use to access DeepSeek r1 and its distill models.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;deepseek-r1-distill-qwen-32b&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Let’s start with &lt;code&gt;deepseek-r1-distill-qwen-32b&lt;/code&gt; because it is the easiest model to get access to and probably the best balance of cost, performance and speed.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;deepseek-r1-distill-qwen-32b&lt;/code&gt; is a distilled version of r1. This model is made by transferring the knowledge from the larger model to the smaller model through a process known as &lt;a href="https://en.wikipedia.org/wiki/Knowledge_distillation" rel="noopener noreferrer"&gt;knowledge distillation&lt;/a&gt;. The 32b qwen model in particular beats other models in several benchmarks, esp. &lt;a href="https://github.com/deepseek-ai/DeepSeek-R1?tab=readme-ov-file#4-evaluation-results" rel="noopener noreferrer"&gt;coding&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;There is only one provider that currently makes this model available for anyone to use: &lt;a href="https://glama.ai/gateway" rel="noopener noreferrer"&gt;Glama Gateway&lt;/a&gt;. Alternatively, you can self-host this model, but expect that you will need approx. 80gb of VRAM.&lt;/p&gt;

&lt;p&gt;Providers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://glama.ai/models/deepseek-r1-distill-qwen-32b" rel="noopener noreferrer"&gt;https://glama.ai/models/deepseek-r1-distill-qwen-32b&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The great thing about 32b is price and response times. It is currently cheaper than the official DeepSeek r1 and responds slightly faster than r1.&lt;/p&gt;

&lt;h2&gt;
  
  
  deepseek-r1-distill-llama-70b
&lt;/h2&gt;

&lt;p&gt;The 70b llama version is also a distilled version of DeepSeek r1. It is based on llama, meaning that there are more providers available.&lt;/p&gt;

&lt;p&gt;Groq is one of the noteworthy providers that have access to this model.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://console.groq.com/docs/models" rel="noopener noreferrer"&gt;https://console.groq.com/docs/models&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The benefit of using Groq is that it is extremely fast (upwards of 300 tokens per second for this model).&lt;/p&gt;

&lt;p&gt;The downside is that the model is severely rate limited. Depending on what you are planning to do, the current rate limits (30k tokens per minute) might be not enough.&lt;/p&gt;

&lt;p&gt;You can also access this model through Glama — &lt;a href="https://glama.ai/models/deepseek-r1-distill-llama-70b" rel="noopener noreferrer"&gt;deepseek-r1-distill-llama-70b&lt;/a&gt;. As a gateway provider, Glama has slightly elevated rate limits and can offer up to 60k tokens per minute.&lt;/p&gt;

&lt;p&gt;Other providers to evaluate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://deepinfra.com/deepseek-ai/DeepSeek-R1-Distill-Llama-70B" rel="noopener noreferrer"&gt;https://deepinfra.com/deepseek-ai/DeepSeek-R1-Distill-Llama-70B&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://novita.ai/models/llm/deepseek-deepseek-r1-distill-llama-70b" rel="noopener noreferrer"&gt;https://novita.ai/models/llm/deepseek-deepseek-r1-distill-llama-70b&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I will update this article as I discover other providers.&lt;/p&gt;

&lt;p&gt;If you were planning to host this model yourself, bare in mind that it requires a lot of vRAM (GB 140). While it is possible to host it with lower spec machine, the performance will be subpar.&lt;/p&gt;

&lt;h2&gt;
  
  
  DeepSeek r1
&lt;/h2&gt;

&lt;p&gt;Finally, if you are trying to get &lt;code&gt;deepseek-r1&lt;/code&gt;, your best bet remains waiting for deepseek.com to clear out the backlog of demand. Allegedly, they are currently experiencing DDoS. Therefore, new user sign ups are currently restricted.&lt;/p&gt;

&lt;p&gt;A few other providers that claim to offer r1:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://novita.ai/models/llm/deepseek-deepseek-r1" rel="noopener noreferrer"&gt;https://novita.ai/models/llm/deepseek-deepseek-r1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://fireworks.ai/models/fireworks/deepseek-r1" rel="noopener noreferrer"&gt;https://fireworks.ai/models/fireworks/deepseek-r1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I explicitly mention “claim to offer” because many of them are oversubscribed at the moment and not able to meet the demand. Even if you sign up, you might get rate limits.&lt;/p&gt;

&lt;p&gt;Unfortunately, hosting r1 yourself is not a viable option for most of us. The model is 671b parameter model. Meaning that you would need at least 1,342 vRAM to host it, which is beyond reach for any home user.&lt;/p&gt;

&lt;p&gt;If you become aware of other providers, please leave a comment and I will add them to the list.&lt;/p&gt;

&lt;h2&gt;
  
  
  Other Distill Models
&lt;/h2&gt;

&lt;p&gt;There are many other &lt;a href="https://github.com/deepseek-ai/DeepSeek-R1" rel="noopener noreferrer"&gt;distilled versions available&lt;/a&gt;. If your goal is to run the model locally, then you should evaluate them based on the benchmarks in the latter GitHub repository. Some small models (like the 1.5B and 7B) can be reasonably run on your local machine and perform decently well.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>deepseek</category>
    </item>
    <item>
      <title>Building ask-PDF service using LLMs</title>
      <dc:creator>Frank Fiegel</dc:creator>
      <pubDate>Sun, 17 Nov 2024 17:58:32 +0000</pubDate>
      <link>https://forem.com/punkpeye/building-ask-pdf-service-using-llms-2km9</link>
      <guid>https://forem.com/punkpeye/building-ask-pdf-service-using-llms-2km9</guid>
      <description>&lt;p&gt;I wanted to add a feature to &lt;a href="https://glama.ai" rel="noopener noreferrer"&gt;Glama&lt;/a&gt; that allows users to upload documents and ask questions about them.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcviip1rbt34g4m1oxqrq.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcviip1rbt34g4m1oxqrq.gif" alt=" " width="760" height="564"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I've built similar features before, but they were always domain specific. For example, looking up recipes, searching for products, etc. A generalized solution had a few unexpected challenges, e.g. converting documents to markdown, splitting documents, indexing documents, and retrieval of documents all turned out to be quite complex.&lt;/p&gt;

&lt;p&gt;In this post, I'll walk through the strategy of splitting documents into smaller chunks, since this took me a while to figure out.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;When you have a domain-specific rag, it is typically easy to just create a dedicated record for every entity in the domain. For example, if you are building a recipe rag, you might have a record for each recipe, ingredient, and step. You don't have to worry about splitting the document into chunks, since you already know the semantic structure of the document.&lt;/p&gt;

&lt;p&gt;However, when you have a generalized rag, your input is just a document. Any document. Even when you convert the document to a markdown format (which has some structure), you still have to figure out how to split the document into context aware chunks.&lt;/p&gt;

&lt;p&gt;Suppose user uploads a document like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Recipe Book&lt;/span&gt;

&lt;span class="gu"&gt;## Recipe 1&lt;/span&gt;

Name: Chocolate Chip Cookies

&lt;span class="gu"&gt;### Ingredients&lt;/span&gt;
&lt;span class="p"&gt;
*&lt;/span&gt; 2 cups all-purpose flour
&lt;span class="p"&gt;*&lt;/span&gt; 1 cup granulated sugar
&lt;span class="p"&gt;*&lt;/span&gt; 1 cup unsalted butter, at room temperature
&lt;span class="p"&gt;*&lt;/span&gt; 1 cup light brown sugar, packed
&lt;span class="p"&gt;*&lt;/span&gt; 2 large eggs
&lt;span class="p"&gt;*&lt;/span&gt; 2 teaspoons vanilla extract
&lt;span class="p"&gt;*&lt;/span&gt; 2 cups semi-sweet chocolate chips

&lt;span class="gu"&gt;### Instructions&lt;/span&gt;
&lt;span class="p"&gt;
1.&lt;/span&gt; Preheat oven to 350°F (180°C). Line a baking sheet with parchment paper.
&lt;span class="p"&gt;2.&lt;/span&gt; In a medium bowl, whisk together flour, sugar, and butter.
&lt;span class="p"&gt;3.&lt;/span&gt; In a large bowl, beat the egg yolks and the egg whites together.
&lt;span class="p"&gt;4.&lt;/span&gt; Stir in the vanilla.
&lt;span class="p"&gt;5.&lt;/span&gt; Gradually stir in the flour mixture until a dough forms.
&lt;span class="p"&gt;6.&lt;/span&gt; Fold in the chocolate chips.
&lt;span class="p"&gt;7.&lt;/span&gt; Drop the dough by rounded tablespoons onto the prepared baking sheet.
&lt;span class="p"&gt;8.&lt;/span&gt; Bake for 8-10 minutes, or until the edges are golden brown.
&lt;span class="p"&gt;9.&lt;/span&gt; Let cool for a few minutes before transferring to a wire rack to cool completely.

&lt;span class="gu"&gt;## Recipe 2 ...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If we knew it is a recipe book, we could just split the document into chunks based on the &lt;code&gt;## Recipe 1&lt;/code&gt; and &lt;code&gt;## Recipe 2&lt;/code&gt; headers. However, since we don't know the structure of the document, we can't just split it based on headers.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;If we split too-high (&lt;code&gt;h2&lt;/code&gt;), we might end up with too large chunks&lt;/li&gt;
&lt;li&gt;If we split too-low (&lt;code&gt;h3&lt;/code&gt;), we might end up with too many small chunks that do not have the necessary context to answer the question&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So we need to split the document such that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Each chunk would have useful embeddings&lt;/li&gt;
&lt;li&gt;Each chunk could retrieve sufficient context to answer the question&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Sounds like an impossible task, right? Well, it is. But I found a solution that works pretty well.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution
&lt;/h2&gt;

&lt;p&gt;The solution is a combination of several techniques.&lt;/p&gt;

&lt;h3&gt;
  
  
  Splitting
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Parsing the document into a tree structure&lt;/li&gt;
&lt;li&gt;Splitting each node in the tree into semantically meaningful chunks&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;Using our example document, the tree structure would look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"children"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"children"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"children"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"### Ingredients&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;* 2 cups all-purpose flour&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;* 1 cup granulated sugar&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;* 1 cup unsalted butter, at room temperature&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;* 1 cup light brown sugar, packed&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;* 2 large eggs&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;* 2 teaspoons vanilla extract&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;* 2 cups semi-sweet chocolate chips&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"heading"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"depth"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Ingredients"&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"children"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"### Instructions&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;1. Preheat oven to 350°F (180°C). Line a baking sheet with parchment paper.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;2. In a medium bowl, whisk together flour, sugar, and butter.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;3. In a large bowl, beat the egg yolks and the egg whites together.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;4. Stir in the vanilla.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;5. Gradually stir in the flour mixture until a dough forms.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;6. Fold in the chocolate chips.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;7. Drop the dough by rounded tablespoons onto the prepared baking sheet.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;8. Bake for 8-10 minutes, or until the edges are golden brown.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;9. Let cool for a few minutes before transferring to a wire rack to cool completely.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"heading"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"depth"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Instructions"&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"## Recipe 1&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;Name: Chocolate Chip Cookies&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"heading"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"depth"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Recipe 1"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"children"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"## Recipe 2 ...&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"heading"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"depth"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Recipe 2"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"heading"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"depth"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Recipe Book"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The benefit of this structure is that we can now store these sections in a database while retaining their hierarchical structure. Here is the database schema:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;                              Table "public.document_section"
           Column           |  Type   | Collation | Nullable |           Default
----------------------------+---------+-----------+----------+------------------------------
 id                         | integer |           | not null | generated always as identity
 uploaded_document_id       | integer |           | not null |
 parent_document_section_id | integer |           |          |
 heading_title              | text    |           | not null |
 heading_depth              | integer |           | not null |
 content                    | text    |           |          |
 sequence_number            | integer |           | not null |
 path                       | ltree   |           | not null |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;[!NOTE]&lt;br&gt;
The &lt;code&gt;path&lt;/code&gt; column is a PostgreSQL &lt;a href="https://www.postgresql.org/docs/current/ltree.html" rel="noopener noreferrer"&gt;ltree&lt;/a&gt; column that allows us to store the hierarchical structure of the document. This is useful for querying later on.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;However, this alone is not enough. Since each section can be infinitely long, we need to split sections into smaller chunks. This also allows us to create more granular embeddings for each chunk.&lt;/p&gt;

&lt;p&gt;I ended up using &lt;a href="https://github.com/syntax-tree/mdast" rel="noopener noreferrer"&gt;&lt;code&gt;mdast&lt;/code&gt;&lt;/a&gt; to split each section into chunks between 1000 and 2000 characters. I made exceptions for tables, code blocks, blockquotes, and lists.&lt;/p&gt;

&lt;p&gt;Here is the resulting database schema:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;                          Table "public.document_section_chunk"
       Column        |     Type     | Collation | Nullable |           Default
---------------------+--------------+-----------+----------+------------------------------
 id                  | integer      |           | not null | generated always as identity
 document_section_id | integer      |           | not null |
 chunk_index         | integer      |           | not null |
 content             | text         |           | not null |
 embedding           | vector(1024) |           | not null |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;embedding&lt;/code&gt; column is a PostgreSQL vector type that stores the embedding of the chunk. I used &lt;code&gt;jina-embeddings-v3&lt;/code&gt; to create the embeddings. I picked something that scores relatively well on the &lt;a href="https://huggingface.co/spaces/mteb/leaderboard" rel="noopener noreferrer"&gt;MTEB&lt;/a&gt; leaderboard, but also relatively low in terms of memory usage.&lt;/p&gt;

&lt;p&gt;Okay, so now we have a database that stores the document sections and their embeddings. The next step is to create a Rag that can retrieve the relevant sections/chunks for a given question.&lt;/p&gt;

&lt;h3&gt;
  
  
  Retrieval
&lt;/h3&gt;

&lt;p&gt;Retrieval is the process of finding the relevant chunks for a given question.&lt;/p&gt;

&lt;p&gt;My process was to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use LLMs to generate several questions &lt;em&gt;based&lt;/em&gt; on user's input, e.g. If user asks "What is the recipe for chocolate chip cookies?", my LLMs would generate queries that break down the question into smaller parts, e.g. "chocolate chip cookies ingredients", "chocolate chip cookies instructions", etc.&lt;/li&gt;
&lt;li&gt;Query the database to find the top &lt;code&gt;N&lt;/code&gt; chunks that match the generated queries.&lt;/li&gt;
&lt;li&gt;Use the &lt;code&gt;document_section_chunk&lt;/code&gt; and &lt;code&gt;document_section&lt;/code&gt; relationship to identify which sections chunks belong to, and which sections are referenced by the chunks the most frequently.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;At this point, we know:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;which chunks are the most relevant to the question&lt;/li&gt;
&lt;li&gt;which sections are the most relevant to the question&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We determine the most relevant sections based on their ordering using &lt;em&gt;cosine distance&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;However, we don't know which sections/chunks can be used to answer the question, i.e. just because a chunk has a low cosine distance to the question, it does not mean that the chunk answers the question. For this step, I ended up using another LLM prompt. The prompt is given the question and the candidate chunks, and it asks the LLM to rank the chunks based on how well they answer the question.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;[!NOTE]&lt;br&gt;
I later learned that Jina has a &lt;a href="https://jina.ai/reranker/" rel="noopener noreferrer"&gt;Reranker&lt;/a&gt; API that does essentially the same thing. I compared the two approaches and found that both solutions perform equally well. However, if you prefer a higher level of abstraction, Reranker is a good choice.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Finally, I have a handful of sections/chunks that answer the question. The last step is to determine which sections/chunk to include in the final answer. I do this by assigning a finite budget to each question (e.g. 1000 tokens), and then prioritize adding the most relevant sections/chunks to the answer. The reason they are separated is that because sometimes a single section can answer the whole question and it might fit in the budget, but sometimes we need to include the more granular chunks to the answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Improvements
&lt;/h2&gt;

&lt;p&gt;As I started typing this post, I realized that there are too many subtle details that if I mentioned them, it would make the post too long.&lt;/p&gt;

&lt;p&gt;A few things I want to mention that helped me improve the solution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I use a simple LLM to generate a brief description of each section. I then create embeddings for those descriptions and use them as part of the logic used to determine which sections to include in the answer.&lt;/li&gt;
&lt;li&gt;I include meta information about each section in the generated answer. For example, the section title, depth, and the surrounding section names.&lt;/li&gt;
&lt;li&gt;I provide multiple tools to the LLMs to help answer the question, e.g. a tool to lookup all mentions of a term in the document, a tool to lookup the next section in the document, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Overall, I think the biggest innovation of this approach is splitting markdown documents into a hierarchical structure, and then splitting each section into smaller chunks. This allows to create a generalized rag that can answer questions about any markdown document.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Implementing Tool Functionality in Conversational AI</title>
      <dc:creator>Frank Fiegel</dc:creator>
      <pubDate>Thu, 17 Oct 2024 23:32:49 +0000</pubDate>
      <link>https://forem.com/punkpeye/implementing-tool-functionality-in-conversational-ai-3n2g</link>
      <guid>https://forem.com/punkpeye/implementing-tool-functionality-in-conversational-ai-3n2g</guid>
      <description>&lt;p&gt;As part of building &lt;a href="https://glama.ai/" rel="noopener noreferrer"&gt;Glama&lt;/a&gt;, I am trying to build a deeper understanding of the concepts behind existing services, such as OpenAI's &lt;a href="https://platform.openai.com/docs/assistants/tools" rel="noopener noreferrer"&gt;assistant tools&lt;/a&gt;. So I decided to write a small PoC that attempts to replicate the functionality.&lt;/p&gt;

&lt;h2&gt;
  
  
  What are assistant tools?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://jrmyphlmn.com/posts/sequential-function-calls" rel="noopener noreferrer"&gt;This blog post&lt;/a&gt; captures well the concepts behind the tools. In short, the tools are a way to define a set of functions that can be called by the model in response to user queries. Furthermore, the model can call multiple functions in sequence to answer complex queries. It can deduce the correct order of function calls to complete a task, eliminating the need for complex routing logic.&lt;/p&gt;

&lt;p&gt;Practical examples of tools include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fetching information from external sources (e.g. fetching the current weather in a given location)&lt;/li&gt;
&lt;li&gt;Calculating complex mathematical expressions (e.g. calculating the total cost of a shopping cart)&lt;/li&gt;
&lt;li&gt;Performing actions on the user's behalf (e.g. sending an email)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, not all models support tools. I wanted to write my own routing implementation so that I could enable access to the tools for all models.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing tools
&lt;/h2&gt;

&lt;p&gt;I started with writing a simple test case that describes the happy path.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;routeMessage&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./routeMessage&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;it&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;vitest&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;zod&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nf"&gt;it&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;uses tools if relevant tools are available&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;plan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;routeMessage&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;What is 2+2?&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Adds two numbers.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="na"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="p"&gt;};&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;addNumbers&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
          &lt;span class="na"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;number&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
          &lt;span class="na"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;number&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="p"&gt;}),&lt;/span&gt;
        &lt;span class="na"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
          &lt;span class="na"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;number&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="p"&gt;}),&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toEqual&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;actions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;addNumbers&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="na"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="na"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The expectation is that the &lt;code&gt;routeMessage&lt;/code&gt; function will understand that the user is asking for the sum of two numbers, and will instruct to call the &lt;code&gt;addNumbers&lt;/code&gt; tool to get the result.&lt;/p&gt;

&lt;p&gt;In order to do this, we need to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Describe the tools available to the model&lt;/li&gt;
&lt;li&gt;Provide model with the conversation history&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Describing tools
&lt;/h3&gt;

&lt;p&gt;Tool descriptions need to be expressed in a way that the model can understand. I simply defaulted to using JSON.&lt;/p&gt;

&lt;p&gt;The only complexity here is that I've used &lt;a href="https://www.npmjs.com/package/zod" rel="noopener noreferrer"&gt;&lt;code&gt;zod&lt;/code&gt;&lt;/a&gt; to describe the expected parameters and response schema of the tools. So the first thing we need to do is convert the &lt;code&gt;zod&lt;/code&gt; schema to JSON schema.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;zodToJsonSchema&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;zod-to-json-schema&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;AnyZodObject&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;zod&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;Tool&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;P&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nx"&gt;AnyZodObject&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;R&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nx"&gt;AnyZodObject&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;infer&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;P&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;infer&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;R&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="nx"&gt;P&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;R&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;describeTool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Tool&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;AnyZodObject&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;AnyZodObject&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;description&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;parameters&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nf"&gt;zodToJsonSchema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;describeTool&lt;/code&gt; is a helper function that we will use to serializing tools to JSON.&lt;/p&gt;

&lt;p&gt;Now that we have a way to describe tools, we need to describe the conversation history.&lt;/p&gt;

&lt;h3&gt;
  
  
  Describing conversation history
&lt;/h3&gt;

&lt;p&gt;I am using &lt;a href="https://www.npmjs.com/package/ai" rel="noopener noreferrer"&gt;&lt;code&gt;ai&lt;/code&gt;&lt;/a&gt; library message format because most readers are familiar with it. However, this implementation does not depend on any specific framework.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;CoreMessage&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ai&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;routeMessage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;CoreMessage&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;
  &lt;span class="nl"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ReadonlyArray&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Tool&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;AnyZodObject&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;AnyZodObject&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I like the &lt;code&gt;ai&lt;/code&gt; library message format because it captures every important aspect of the conversation, including the role of each participant, the content of the message, and the tools used. We need the conversation history to include the tool invocations, so that the model would have the context about what tools were already used in the conversation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Writing the prompt
&lt;/h3&gt;

&lt;p&gt;At the end, the entire routing logic is expressed as a prompt.&lt;/p&gt;

&lt;p&gt;I've experimented with different prompts, and this is what I landed on.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are an assistant with access to the following tools:

[tools]

Tools are described in the following format:

* "description" describes what the tool does.
* "name" is the name of the tool.
* "parameters" is the JSON schema of the tool parameters (or null if the tool does not have parameters).

You are also given the following conversation history:

[messages]

The conversation history is a list of messages exchanged between the user and the assistant. It may also describe previous actions taken by the assistant.

Based on the conversation history, and the tools you have access to, propose a plan for how to answer the user's question.

The response should be a JSON object with "actions" property, which is an array of tools to use. Each tool is represented as an object with the following properties:

* "name": the name of the tool to use.
* "parameters": the parameters to pass to the tool (or null if the tool does not have parameters).

The same tool can be used multiple times in the plan.

If the conversation does not necessitate the use of tools, respond with an empty action plan, e.g.,

{
  actions: [],
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Implementing the routing logic
&lt;/h2&gt;

&lt;p&gt;Now that we have all the pieces in place, we can put them together.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;[!NOTE]&lt;br&gt;
&lt;code&gt;quickPrompt&lt;/code&gt; is a simple utility function that I use to execute prompts with expected response schema.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;quickPrompt&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./quickPrompt&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;CoreMessage&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ai&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;multiline&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;multiline-ts&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;AnyZodObject&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;zod&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;zodToJsonSchema&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;zod-to-json-schema&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;SerializableZodSchema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;union&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;number&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;()]),&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;Tool&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;P&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nx"&gt;AnyZodObject&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;R&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nx"&gt;AnyZodObject&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;infer&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;P&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;infer&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;R&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="nx"&gt;P&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;R&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;describeTool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Tool&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;AnyZodObject&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;AnyZodObject&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;description&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;parameters&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nf"&gt;zodToJsonSchema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;routeMessage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;CoreMessage&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;
  &lt;span class="nl"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ReadonlyArray&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Tool&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;AnyZodObject&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;AnyZodObject&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;quickPrompt&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;openai@gpt-4o-mini&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;routeMessage&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;multiline&lt;/span&gt;&lt;span class="s2"&gt;`
      You are an assistant with access to the following tools:

      &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;describeTool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
      &lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;

      Tools are described in the following format:

      * "description" describes what the tool does.
      * "name" is the name of the tool.
      * "parameters" is the JSON schema of the tool parameters (or null if the tool does not have parameters).

      You are also given the following conversation history:

      &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;

      The conversation history is a list of messages exchanged between the user and the assistant. It may also describe previous actions taken by the assistant.

      Based on the conversation history, and the tools you have access to, propose a plan for how to answer the user's question.

      The response should be a JSON object with "actions" property, which is an array of tools to use. Each tool is represented as an object with the following properties:

      * "name": the name of the tool to use.
      * "parameters": the parameters to pass to the tool (or null if the tool does not have parameters).

      The same tool can be used multiple times in the plan.

      If the conversation does not necessitate the use of tools, respond with an empty action plan, e.g.,

      {
        actions: [],
      }
    `&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;zodSchema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;actions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
          &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;SerializableZodSchema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;nullable&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="p"&gt;}),&lt;/span&gt;
      &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This function captures the essence of the routing logic. It takes the conversation history and the tools available to the model, and returns a plan for how to answer the user's question.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluating the routing logic
&lt;/h2&gt;

&lt;p&gt;The idea is that every time the user asks a question, we should use &lt;code&gt;routeMessage&lt;/code&gt; to determine if the question requires the use of tools, and if so, which tools to use.&lt;/p&gt;

&lt;p&gt;Inside &lt;a href="https://glama.ai/" rel="noopener noreferrer"&gt;Glama&lt;/a&gt;, I am using &lt;code&gt;routeMessage&lt;/code&gt; in the following way:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// We may have multiple cycles of invocations.&lt;/span&gt;
&lt;span class="c1"&gt;// See the explanation after this code example.&lt;/span&gt;
&lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;toolCycle&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;MAX_TOOL_CYCLES&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;plan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;routeMessage&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="c1"&gt;// If `routeMessage` returns an empty plan, it means that the conversation does not require the use of tools.&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;actions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// For each tool we use, we need to record the invocation and the result in the conversation history.&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;action&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;actions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;toolCallId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;randomUUID&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;action&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="nx"&gt;toolCallId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="na"&gt;toolName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;action&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tool-call&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;],&lt;/span&gt;
      &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;assistant&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;invokeUserTool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;action&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;action&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="nx"&gt;toolCallId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="na"&gt;toolName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;action&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tool-result&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;],&lt;/span&gt;
      &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tool&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Now pass the messages history to the LLM which will use the recorded tool calls to generate a response.&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;streamAssistantResponse&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;abortController&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;visitor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;visitor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code above is mostly self-explanatory. The only tricky part is that we need to invoke &lt;code&gt;routeMessage&lt;/code&gt; in a loop because the model may need to use multiple tools to answer the user's question. For example, if the user asks 'What's the weather in New York?', the model may first use a tool to reverse geocode the location and then use another tool to fetch the current weather at the given coordinates.&lt;/p&gt;

&lt;p&gt;This appears to be the entirety of the routing logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benefits of self-implemented routing logic
&lt;/h2&gt;

&lt;p&gt;At the end, I prefer to implement the routing logic myself because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It allows me to use tools with models that do not natively support tools.&lt;/li&gt;
&lt;li&gt;I can expand the logic for how the tools are resolved. For example, I want to load a subset of tools based on the user's prompt, or I want to prioritize tools based on the frequency of use.&lt;/li&gt;
&lt;li&gt;There is no ambiguity about the cost of using tools. To this day, I have no clue what is OpenAI's pricing is for utilising tools.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>node</category>
      <category>typescript</category>
    </item>
    <item>
      <title>Replacing GitHub Copilot with Local LLMs</title>
      <dc:creator>Frank Fiegel</dc:creator>
      <pubDate>Fri, 11 Oct 2024 19:11:59 +0000</pubDate>
      <link>https://forem.com/punkpeye/replacing-github-copilot-with-local-llms-ce9</link>
      <guid>https://forem.com/punkpeye/replacing-github-copilot-with-local-llms-ce9</guid>
      <description>&lt;p&gt;As part of developing &lt;a href="https://glama.ai/" rel="noopener noreferrer"&gt;Glama&lt;/a&gt;, I try to stay at the cutting edge of everything AI, especially when it comes to LLM-enabled development. I've tried &lt;a href="https://github.com/features/copilot" rel="noopener noreferrer"&gt;GitHub Copilot&lt;/a&gt;, &lt;a href="https://supermaven.com/" rel="noopener noreferrer"&gt;Supermaven&lt;/a&gt;, and many other AI code completion tools. However, earlier this week I gave a try to locally hosted LLMs and &lt;em&gt;I am not coming back&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;These instructions assume that you are a macOS user.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The setup takes no more than a few minutes.&lt;/p&gt;

&lt;p&gt;Download and install &lt;a href="https://ollama.com/" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What about &lt;a href="https://lmstudio.ai/" rel="noopener noreferrer"&gt;LM Studio&lt;/a&gt;? I saw a few posts debate one over the other. LM Studio has intuitive UI; Ollama does not. However, my research led me to belief that &lt;a href="https://www.reddit.com/r/LocalLLaMA/comments/1c18hgj/why_ollama_faster_than_lmstudio/" rel="noopener noreferrer"&gt;Ollama is faster than LM Studio&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Install the model that you want to use.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ollama pull starcoder2:3b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I've evaluated a few and landed on &lt;code&gt;starcoder2:3b&lt;/code&gt;. It provides a good balance of usefuless and interference speed.&lt;/p&gt;

&lt;p&gt;For context, the following table shows the speed of each model.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;tokens/second&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;starcoder2:3b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;99&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llama3.1:8b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;54&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;codestral:22b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Finally, install a &lt;a href="https://www.continue.dev/" rel="noopener noreferrer"&gt;continue.dev&lt;/a&gt; – a VSCode extension that enables tab completion (and chat) using local LLMs.&lt;/p&gt;

&lt;p&gt;Then update continue.dev settings to use the desired model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Starcoder2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ollama"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"starcoder2:3b"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tabAutocompleteModel"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Starcoder2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ollama"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"starcoder2:3b"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Restart VSCode and you should be good to go.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Ensure that you've disabled GitHub Copilot and other overlapping VSCode extensions.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Pros and Cons
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pros
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Offline Availability&lt;/strong&gt;: Work anywhere without relying on an internet connection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privacy&lt;/strong&gt;: Your code and prompts never leave your machine, ensuring maximum data privacy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Customization&lt;/strong&gt;: Ability to fine-tune models to your specific needs or codebase.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No Subscription Costs&lt;/strong&gt;: Once set up, there are no ongoing fees unlike many cloud-based services.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistent Performance&lt;/strong&gt;: No latency issues due to poor internet connection or server load.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open Source&lt;/strong&gt;: Many local LLMs are open-source, allowing for community improvements and transparency.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cons
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Initial Setup Time&lt;/strong&gt;: Requires some time and technical knowledge to set up properly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hardware Requirements&lt;/strong&gt;: Local LLMs can be resource-intensive, requiring a reasonably powerful machine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limited Model Size&lt;/strong&gt;: Typically, local models are smaller than their cloud-based counterparts, which might affect performance for some tasks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manual Updates&lt;/strong&gt;: You need to manually update models and tools to get the latest improvements.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;I was hesitant to adopt local LLMs because services like GitHub Copilot "just work." However, as I've been traveling the world, I found myself often regretting having to depend on Internet connection for my auto completions. In that sense, switching to a local model has been a huge win for me. If Internet connectivity was not issue, I think services like Supermaven are still very appealing and worth the cost.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If you are not familiar with &lt;a href="https://supermaven.com/" rel="noopener noreferrer"&gt;Supermaven&lt;/a&gt; and if you are Okay with depending on Internet connection, then it's worth checking out. Compared to GitHub Copilot, I found Supermaven's auto completion to be much more reliable and &lt;em&gt;much&lt;/em&gt; faster.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;However, if you are like me and want your code completion to work with or without an Internet connection, then this is definitely worth a try.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>development</category>
      <category>ai</category>
      <category>vscode</category>
    </item>
    <item>
      <title>No Single LLM Can Be Trusted in Isolation</title>
      <dc:creator>Frank Fiegel</dc:creator>
      <pubDate>Thu, 10 Oct 2024 22:40:27 +0000</pubDate>
      <link>https://forem.com/punkpeye/no-single-llm-can-be-trusted-in-isolation-1nh2</link>
      <guid>https://forem.com/punkpeye/no-single-llm-can-be-trusted-in-isolation-1nh2</guid>
      <description>&lt;p&gt;I started building &lt;a href="https://glama.ai" rel="noopener noreferrer"&gt;Glama&lt;/a&gt; after a simple observation: no single LLM can be trusted in isolation.&lt;/p&gt;

&lt;h2&gt;
  
  
  From Awe to Skepticism
&lt;/h2&gt;

&lt;p&gt;Like many others, my first exposure to LLMs was through &lt;a href="https://openai.com/" rel="noopener noreferrer"&gt;OpenAI&lt;/a&gt;'s &lt;a href="https://en.wikipedia.org/wiki/GPT-2" rel="noopener noreferrer"&gt;GPT-2 model&lt;/a&gt;. At first, I would compose a prompt, share it with the model, and usually accept its response as "most likely correct." However, as many others at the time, I still viewed the technology as a promise of what will be possible in the future, rather than a trustworthy peer to consult with.&lt;/p&gt;

&lt;p&gt;Later, in June 2020, as &lt;a href="https://en.wikipedia.org/wiki/GPT-3" rel="noopener noreferrer"&gt;GPT-3&lt;/a&gt; came out, wowed by many incredible demos, I began exploring what it is like to rely on LLMs for helping with everyday tasks in my domain of expertise. This is where my trust in LLMs began to diminish...&lt;/p&gt;

&lt;h2&gt;
  
  
  Trust, but Verify
&lt;/h2&gt;

&lt;p&gt;There is a phenomenon known as the &lt;a href="https://www.epsilontheory.com/gell-mann-amnesia/" rel="noopener noreferrer"&gt;Gell-Mann Amnesia effect&lt;/a&gt;. The effect describes how an expert can spot numerous errors in an article about their field but then accept information on other subjects as accurate, forgetting the flaws they just identified. Being aware of the phenomenon, and observing the frequency of errors in the information I was receiving, I stopped trusting LLMs without validating their responses.&lt;/p&gt;

&lt;p&gt;Over time, more models started to appear, each one making more grandiose statements than the others. I started to experiment with all of them. No matter what the prompt was, I developed a habit of copy-pasting my prompts across multiple models like OpenAI, Claude, and Gemini. This change in behavior led me to a further insight:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A single LLM might be unreliable, but when multiple models independently reach the same conclusion, it boosts confidence in the accuracy of the information.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;As a result, my trust in LLMs became proportional to the level of consensus achieved by consulting multiple models.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations of LLMs
&lt;/h2&gt;

&lt;p&gt;We've established that relying on any single LLM is dangerous. Based on my understanding of the technology, I believe this limitation to be inherent to LLMs (rather than a question of model quality). It's because of the following reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Dataset Bias&lt;/strong&gt;: Each LLM is trained on a specific dataset, inheriting its biases and limitations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Knowledge Cutoff&lt;/strong&gt;: LLMs have a fixed knowledge cutoff date, lacking information on recent events.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucination&lt;/strong&gt;: LLMs can generate plausible-sounding but incorrect information.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Domain Specificity&lt;/strong&gt;: Models excel in certain areas but underperform in others.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ethical Inconsistency&lt;/strong&gt;: Alignment techniques vary, leading to inconsistent handling of ethical queries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Overconfidence&lt;/strong&gt;: LLMs may present incorrect information with high confidence.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By leveraging multiple LLMs, we can mitigate these limitations. Different models can complement each other's strengths, allow user to cross-verify information, and provide a more balanced perspective. This approach, while not perfect, significantly improves the trustworthiness of LLMs.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;[!NOTE]&lt;br&gt;
In addition to what's being discussed in this article, I also want to draw attention to the emergence of 'AI services' (it's no longer accurate to call them just LLM models) that are capable of reasoning. These services combine techniques such as Dynamic Chain-of-Thought (CoT), Reflection, and Verbal Reinforcement Learning to provide responses that aim to offer a higher degree of trust. There is a &lt;a href="https://medium.com/@harishhacker3010/can-we-make-any-smaller-opensource-ai-models-smarter-than-human-1ea507e644a0" rel="noopener noreferrer"&gt;great article&lt;/a&gt; that goes into detail about what these techniques are and how they work. We are actively working on bringing these capabilities to Glama.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Glama: Streamlining Multi-Model Interactions
&lt;/h2&gt;

&lt;p&gt;Recognizing the limitations of single-model reliance, I developed &lt;a href="https://glama.ai/" rel="noopener noreferrer"&gt;Glama&lt;/a&gt; as a solution to streamline the process of gaining perspectives from multiple LLMs. Glama provides a unified platform where users can interact with various AI models simultaneously, effectively creating a panel of AI advisors.&lt;/p&gt;

&lt;p&gt;Key features of Glama include:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multi-Model Querying&lt;/strong&gt;: Simultaneously consult multiple LLMs, including the latest from Google, OpenAI, and Anthropic.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Enterprise-Grade Security&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your data remains under your control, never used for model training.&lt;/li&gt;
&lt;li&gt;End-to-end encryption (AES 256, TLS 1.2+) for data in transit and at rest.&lt;/li&gt;
&lt;li&gt;SOC 2 compliance, meeting stringent security standards.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Seamless Integration&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Admin console for easy team management, including SSO and domain verification.&lt;/li&gt;
&lt;li&gt;Collaborative features like shared chat templates for streamlined workflows.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Comparative Analysis&lt;/strong&gt;: Easily compare responses side-by-side to identify consistencies and discrepancies across models.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Customizable Model Selection&lt;/strong&gt;: Choose which LLMs to consult based on your specific needs and security requirements.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By facilitating secure, efficient access to diverse AI perspectives, Glama empowers users to make more informed decisions, leveraging the strengths of multiple models while mitigating individual weaknesses – all within a robust, enterprise-ready environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In today's AI landscape, relying on a single LLM is akin to seeking advice from just one expert – potentially valuable, but inherently limited. Glama embodies the principle that diversity in AI perspectives leads to more robust and reliable outcomes. By streamlining access to multiple LLMs, Glama not only saves time but also enhances the quality of AI-assisted decision-making.&lt;/p&gt;

&lt;p&gt;As we continue to navigate the evolving world of AI, tools like Glama will play a crucial role in helping users harness the collective intelligence of multiple models...&lt;/p&gt;

&lt;p&gt;There's no one AI to rule them all – but with Glama, you can leverage the power of many.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>sideprojects</category>
    </item>
  </channel>
</rss>
