<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Binary Ink</title>
    <description>The latest articles on Forem by Binary Ink (@david_shawn_e308bed98c45b).</description>
    <link>https://forem.com/david_shawn_e308bed98c45b</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2248370%2F67d81f95-de64-48d3-966d-e006e93c4fbe.jpg</url>
      <title>Forem: Binary Ink</title>
      <link>https://forem.com/david_shawn_e308bed98c45b</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/david_shawn_e308bed98c45b"/>
    <language>en</language>
    <item>
      <title>I Tested Gemma 4 on My Laptop and Turned It Into a Free Intelligence Layer for My AI Apps</title>
      <dc:creator>Binary Ink</dc:creator>
      <pubDate>Fri, 03 Apr 2026 16:51:28 +0000</pubDate>
      <link>https://forem.com/david_shawn_e308bed98c45b/i-tested-gemma-4-on-my-laptop-and-turned-it-into-a-free-intelligence-layer-for-my-ai-apps-8dh</link>
      <guid>https://forem.com/david_shawn_e308bed98c45b/i-tested-gemma-4-on-my-laptop-and-turned-it-into-a-free-intelligence-layer-for-my-ai-apps-8dh</guid>
      <description>&lt;p&gt;&lt;em&gt;How a $0 local model replaced $10/day in API calls across four production modules&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;I've been building MasterCLI — a multi-module AI-native desktop platform written in Go, React, and PostgreSQL. It includes a RAG knowledge base, a multi-agent discussion forum, and an orchestration hub (Nexus).&lt;/p&gt;

&lt;p&gt;All of these modules were calling cloud APIs (GPT-4o-mini, Claude) for tasks like classifying user queries, extracting structured data from documents, and preprocessing messages. That's roughly &lt;strong&gt;$10/day in API costs&lt;/strong&gt; just for classification and extraction — tasks that don't need frontier-model intelligence.&lt;/p&gt;

&lt;p&gt;Then Google released &lt;strong&gt;Gemma 4&lt;/strong&gt; (8B) and I decided to test it locally. Here's what I found, and how I integrated it into four production modules in one afternoon.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup: Nothing Fancy
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Laptop&lt;/strong&gt;: Regular gaming laptop with an RTX 3070 Ti (8GB VRAM)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model&lt;/strong&gt;: Gemma 4 8B, Q4_K_M quantization (9.6GB on disk)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runtime&lt;/strong&gt;: Ollama v0.20.0&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OS&lt;/strong&gt;: Windows 11&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model doesn't even fit entirely in VRAM — it partially offloads to system RAM. This is a real-world test, not a cloud GPU benchmark.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama pull gemma4
ollama list
&lt;span class="c"&gt;# gemma4:latest  9.6 GB  Q4_K_M&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Benchmark: Surprises Everywhere
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Speed: Consistent ~25 tok/s
&lt;/h3&gt;

&lt;p&gt;Across all tests, generation speed held steady:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Tokens&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Simple Q&amp;amp;A&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.6s&lt;/td&gt;
&lt;td&gt;19.8 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Go code generation&lt;/td&gt;
&lt;td&gt;600&lt;/td&gt;
&lt;td&gt;25.7s&lt;/td&gt;
&lt;td&gt;23.4 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chinese JSON extraction&lt;/td&gt;
&lt;td&gt;500&lt;/td&gt;
&lt;td&gt;18.5s&lt;/td&gt;
&lt;td&gt;27.1 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Intent classification&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;0.4s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;25.6 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool calling&lt;/td&gt;
&lt;td&gt;34&lt;/td&gt;
&lt;td&gt;1.3s&lt;/td&gt;
&lt;td&gt;27.1 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Prompt processing was much faster: 120-850 tok/s depending on batch size.&lt;/p&gt;

&lt;h3&gt;
  
  
  Discovery #1: It's a Thinking Model
&lt;/h3&gt;

&lt;p&gt;This was the biggest surprise. When I first ran the tests, responses appeared empty. After debugging the streaming output, I discovered Gemma 4 is a &lt;strong&gt;thinking model&lt;/strong&gt; — like DeepSeek-R1 or o1.&lt;/p&gt;

&lt;p&gt;For complex questions, the response looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"assistant"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"thinking"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Here's a thinking process..."&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"assistant"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"thinking"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;" to arrive at..."&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;many&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;thinking&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;tokens&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"assistant"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"The three main patterns are..."&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model spends tokens on chain-of-thought reasoning in the &lt;code&gt;thinking&lt;/code&gt; field before producing the final answer in &lt;code&gt;content&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The critical parameter&lt;/strong&gt;: &lt;code&gt;"think": false&lt;/code&gt; disables this behavior:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;think=true&lt;/th&gt;
&lt;th&gt;think=false&lt;/th&gt;
&lt;th&gt;Speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Classification&lt;/td&gt;
&lt;td&gt;6.9s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.9s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7.7x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JSON extraction&lt;/td&gt;
&lt;td&gt;19.4s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.3s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.5x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code generation&lt;/td&gt;
&lt;td&gt;26.7s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;13.3s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For structured extraction and classification, &lt;code&gt;think=false&lt;/code&gt; is essential. You get the same quality output without the reasoning overhead.&lt;/p&gt;

&lt;h3&gt;
  
  
  Discovery #2: Ollama API Quirks
&lt;/h3&gt;

&lt;p&gt;Two gotchas that cost me an hour of debugging:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;/api/generate&lt;/code&gt; is broken&lt;/strong&gt; for Gemma 4 — the &lt;code&gt;response&lt;/code&gt; field is always empty (tokens are generated but not decoded to text). You &lt;strong&gt;must&lt;/strong&gt; use &lt;code&gt;/api/chat&lt;/code&gt; instead.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Tool calling needs &lt;code&gt;num_predict &amp;gt;= 2048&lt;/code&gt;&lt;/strong&gt; — with smaller budgets, thinking tokens consume the entire allocation and tool calls never emit. With enough headroom, the model is smart enough to skip thinking and call tools directly (34 tokens, 1.3s).&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Discovery #3: Tool Calling is Excellent
&lt;/h3&gt;

&lt;p&gt;Given this tool definition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"search_contracts"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"parameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"min_budget"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"number"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"enum"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"IT"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;"construction"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;"services"&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the prompt: &lt;em&gt;"Find IT contracts over 5M CNY"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Gemma 4 correctly inferred:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"search_contracts"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"arguments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"IT"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"min_budget"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5000000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"IT contracts"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;34 tokens, 1.3 seconds.&lt;/strong&gt; No thinking needed. This makes it viable for real-time tool routing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture: Tiered Intelligence
&lt;/h2&gt;

&lt;p&gt;Based on the benchmarks, I designed a two-tier system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Request
    |
    v
+------------------+
|  Gemma 4 (local) |  &amp;lt;-- Fast classification, extraction, routing
|  think=false     |      Latency: &amp;lt;1-4s, Cost: $0
|  ~25 tok/s       |
+--------+---------+
         |
    +----+----+
    | Simple  | --&amp;gt; Return directly (classification, extraction, tags)
    | Complex | --&amp;gt; Escalate to cloud
    +----+----+
         v
+------------------+
| Claude/GPT (API) |  &amp;lt;-- Complex reasoning, long-form generation
| High quality     |      Latency: 2-10s, Pay per token
+------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: &lt;strong&gt;most "intelligence" tasks in a multi-module app are simple classification and extraction&lt;/strong&gt; — exactly what a local 8B model excels at.&lt;/p&gt;

&lt;h2&gt;
  
  
  Four Integrations in One Afternoon
&lt;/h2&gt;

&lt;h3&gt;
  
  
  P1: Master RAG — Query Classification Middleware
&lt;/h3&gt;

&lt;p&gt;The RAG knowledge base has 80+ domains and 7 namespaces. Previously, users had to manually specify &lt;code&gt;domains: ["ai-ml"]&lt;/code&gt; in their searches.&lt;/p&gt;

&lt;p&gt;Now Gemma 4 auto-classifies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;DB&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ClassifyQuery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;QueryClassification&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;QuickClassify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;classifyPrompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c"&gt;// Returns: {domains: ["ai-ml"], namespaces: ["code"], search_mode: "hybrid"}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: &amp;lt;1s to auto-detect domain/namespace. Users just type their query naturally.&lt;/p&gt;

&lt;h3&gt;
  
  
  P2: Forum — Message Preprocessing
&lt;/h3&gt;

&lt;p&gt;The multi-agent discussion forum runs 3+1 AI agents (Claude, Codex, Gemini + coordinator). Each message was going to the cloud for analysis.&lt;/p&gt;

&lt;p&gt;Now messages are preprocessed locally — &lt;strong&gt;in a goroutine so it doesn't block the discussion&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Server&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;handleSpeak&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agentID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;go&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;meta&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;preprocessMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agentID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;meta&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hub&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"forum:post:meta"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}()&lt;/span&gt;
    &lt;span class="c"&gt;// ... save post and advance turn (not blocked) ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: Intent classification, sentiment analysis, and topic extraction — all in &amp;lt;1s, invisible to the discussion flow.&lt;/p&gt;

&lt;h3&gt;
  
  
  P3: Nexus — Tool Routing
&lt;/h3&gt;

&lt;p&gt;Nexus orchestrates multiple AI agent terminals. When creating a new agent session, the system now classifies the task intent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User: "What design patterns are used in the codebase?"
Gemma4: module=code, confidence=0.87, hint=grep
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is exposed as both an internal routing signal and a standalone MCP tool (&lt;code&gt;classify_intent&lt;/code&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Bonus: The Duck Secretary Gets a Brain
&lt;/h3&gt;

&lt;p&gt;MasterCLI's Dashboard has a mascot — a yellow rubber duck secretary ("yellow rubber duck") that scans the project state and generates daily briefings. Before Gemma4, it produced mechanical summaries like &lt;code&gt;"28 task(s) ready, 10 active goal(s)"&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now it generates actual insights:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before: "28 task(s) ready, 10 active goal(s)"

The Browser module currently has the largest backlog, with 11 pending tasks.
         B-13, B-14, and B-15 are ready to begin.
         Prioritizing this batch today would also help create a more stable foundation for Dashboard and Nexus."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key was &lt;strong&gt;prompt compression&lt;/strong&gt;: a long prompt (180 chars, 5 requirements) took 19.7s. A one-line prompt (50 chars) with compact data produced equally good output in &lt;strong&gt;4.3s&lt;/strong&gt;. The duck is now genuinely useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Go Client: 150 Lines
&lt;/h2&gt;

&lt;p&gt;Each module gets a lightweight Ollama chat client — the same pattern, ~150 lines of Go:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;OllamaChat&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;endpoint&lt;/span&gt;   &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="c"&gt;// "http://localhost:11434"&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;      &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="c"&gt;// "gemma4"&lt;/span&gt;
    &lt;span class="n"&gt;httpClient&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;OllamaChat&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;QuickClassify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// POST /api/chat with stream=true, think=false, num_predict=128&lt;/span&gt;
    &lt;span class="c"&gt;// Concatenate streaming chunks, return content&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key configuration rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Always use &lt;code&gt;/api/chat&lt;/code&gt;&lt;/strong&gt;, never &lt;code&gt;/api/generate&lt;/code&gt; (Gemma 4 bug)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;think: false&lt;/code&gt;&lt;/strong&gt; for classification/extraction (7x faster)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;num_predict: 2048&lt;/code&gt;&lt;/strong&gt; for tool calling (needs headroom)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming mode&lt;/strong&gt; to capture both &lt;code&gt;thinking&lt;/code&gt; and &lt;code&gt;content&lt;/code&gt; fields&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cost Analysis
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before (Cloud API)&lt;/th&gt;
&lt;th&gt;After (Local Gemma 4)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RAG classification&lt;/td&gt;
&lt;td&gt;~$7/day&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Forum preprocessing&lt;/td&gt;
&lt;td&gt;~$8/day&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nexus routing&lt;/td&gt;
&lt;td&gt;~$1/day&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Duck Secretary insight&lt;/td&gt;
&lt;td&gt;~$1/day&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$17/day&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0 + electricity&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Annual savings&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$6,200&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The tradeoff: ~25 tok/s means you can't use it for long-form generation. But for classification, extraction, and routing? It's free and fast enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Gemma 4 is a thinking model&lt;/strong&gt; — if you don't know this, your responses look empty. Use &lt;code&gt;think: false&lt;/code&gt; for production workloads.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;8B models are production-ready for structured tasks&lt;/strong&gt; — classification, extraction, tool calling. Don't overpay for intelligence you don't need.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The Ollama API has model-specific quirks&lt;/strong&gt; — always test with your specific model. Gemma 4 breaks the generate endpoint.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hybrid architecture wins&lt;/strong&gt; — local models for fast/cheap tasks, cloud for complex reasoning. The routing logic itself can run on the local model.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Go + Ollama streaming is straightforward&lt;/strong&gt; — the &lt;code&gt;/api/chat&lt;/code&gt; streaming protocol is simple JSON lines. No SDK needed.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Going Deeper
&lt;/h2&gt;

&lt;p&gt;The hybrid architecture in this article — local models for routing, cloud models for reasoning — is one of the patterns I cover in depth in my two books:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://shadowshao.gumroad.com/l/production-mcp-servers-go" rel="noopener noreferrer"&gt;"Production MCP Servers with Go"&lt;/a&gt;&lt;/strong&gt; covers the full lifecycle of building MCP servers like the ones powering Master RAG: tool calling, resource management, authentication, testing, and deployment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://shadowshao.gumroad.com/l/building-ai-coding-agents" rel="noopener noreferrer"&gt;"Building AI Coding Agents"&lt;/a&gt;&lt;/strong&gt; goes wider — agent loops, context management, safety models, eval frameworks, and multi-agent orchestration. The model routing pattern from Chapter 6 is exactly what this article implements with Gemma 4.&lt;/p&gt;

&lt;p&gt;Both are based on the same production codebase described here.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have you tested Gemma 4 locally? What's your experience with hybrid local/cloud architectures? I'd love to hear about your setup in the comments.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags&lt;/strong&gt;: #gemma4 #ollama #golang #ai #mcp #localllm #devtools&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Series&lt;/strong&gt;: Building AI-Native Applications with Go&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cover image description&lt;/strong&gt;: A laptop with terminal showing Ollama running Gemma 4, with performance metrics overlay showing ~25 tok/s generation speed.&lt;/p&gt;

</description>
      <category>gemma4</category>
      <category>ollama</category>
      <category>go</category>
      <category>ai</category>
    </item>
    <item>
      <title>I wrote the first book on building production MCP servers with Go</title>
      <dc:creator>Binary Ink</dc:creator>
      <pubDate>Thu, 02 Apr 2026 03:55:40 +0000</pubDate>
      <link>https://forem.com/david_shawn_e308bed98c45b/i-wrote-the-first-book-on-building-production-mcp-servers-with-go-14b9</link>
      <guid>https://forem.com/david_shawn_e308bed98c45b/i-wrote-the-first-book-on-building-production-mcp-servers-with-go-14b9</guid>
      <description>&lt;p&gt;Most MCP tutorials use Python. That's fine for prototypes. But when you need a server that handles thousands of concurrent connections on 128 MB of RAM, starts in 50ms, and deploys as a single binary — you need Go.&lt;/p&gt;

&lt;p&gt;I spent the last few months building MCP servers in Go for production systems. Eight different servers, 4,000+ lines of production code, handling real workloads across project management, browser automation, knowledge bases, and multi-agent orchestration.&lt;/p&gt;

&lt;p&gt;Then I realized: &lt;strong&gt;there is no book on this.&lt;/strong&gt; Not one. The MCP docs cover the protocol. There are Python quickstarts. TypeScript examples. But nothing that shows you how to build a production Go MCP server with authentication, database integration, deployment, and billing.&lt;/p&gt;

&lt;p&gt;So I wrote one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Go for MCP Servers?
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Python&lt;/th&gt;
&lt;th&gt;TypeScript&lt;/th&gt;
&lt;th&gt;Go&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~50-100 MB&lt;/td&gt;
&lt;td&gt;~30-60 MB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~5-15 MB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Startup&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1-3s&lt;/td&gt;
&lt;td&gt;0.5-1s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt;50ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Concurrency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;asyncio&lt;/td&gt;
&lt;td&gt;event loop&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;goroutines&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;venv + pip&lt;/td&gt;
&lt;td&gt;node_modules&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;single binary&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cross-compile&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;painful&lt;/td&gt;
&lt;td&gt;painful&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;GOOS=linux go build&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The numbers matter when you run multiple MCP servers. A Go MCP server uses 10x less memory than Python, starts 50x faster, and deploys as a single file with zero dependencies.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned from production
&lt;/h2&gt;

&lt;p&gt;Here are patterns that aren't in any tutorial:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Serve SSE and Streamable HTTP on the same port
&lt;/h3&gt;

&lt;p&gt;Different AI clients use different transports. Claude uses SSE. Codex uses Streamable HTTP. Don't make users configure which one — serve both:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;mux&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewServeMux&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;mux&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/sse"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sseHandler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;       &lt;span class="c"&gt;// Claude, Gemini&lt;/span&gt;
&lt;span class="n"&gt;mux&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/mcp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;streamHandler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c"&gt;// Codex, newer clients&lt;/span&gt;
&lt;span class="n"&gt;mux&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HandleFunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/health"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;healthCheck&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ListenAndServe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;":8080"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mux&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One port. Every client works.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Business errors vs. system errors
&lt;/h3&gt;

&lt;p&gt;This is the #1 mistake in MCP server code. Tool handlers return two things: a result and an error. They mean different things:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Business error — the AI sees this and can retry&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewToolResultError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"user not found"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;

&lt;span class="c"&gt;// System error — crashes the request (database down, etc.)&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Errorf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"connection lost"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Return &lt;code&gt;NewToolResultError&lt;/code&gt; for "that didn't work, try something else." Return Go &lt;code&gt;error&lt;/code&gt; for "something is fundamentally broken." The AI handles the first kind gracefully. The second kind may close the connection.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Bearer token auth with browser fallback
&lt;/h3&gt;

&lt;p&gt;Browser &lt;code&gt;EventSource&lt;/code&gt; API cannot set custom headers. Period. So when a browser-based MCP client connects via SSE, the token goes in the URL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;authMiddleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HandlerFunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;auth&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Header&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Authorization"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;auth&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="c"&gt;// Browser fallback — EventSource can't set headers&lt;/span&gt;
            &lt;span class="n"&gt;auth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"Bearer "&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;URL&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Query&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"token"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;auth&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;"Bearer "&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;expectedToken&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Unauthorized"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;401&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;next&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServeHTTP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Yes, the token appears in server logs. Use HTTPS, rotate tokens, and strip query params from access logs.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Session cleanup via MCP hooks
&lt;/h3&gt;

&lt;p&gt;MCP clients hold long-lived connections. When they disconnect (laptop closes, network drops), you need to clean up. The mcp-go library fires lifecycle hooks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;hooks&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AddOnUnregisterSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ClientSession&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;sessionID&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SessionID&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;agentID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sessionAgents&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LoadAndDelete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sessionID&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;disconnectAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agentID&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you skip this, you leak memory. Every disconnected session stays in your maps forever.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Symlinks break your path validation
&lt;/h3&gt;

&lt;p&gt;Most file-handling tools do this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// LOOKS safe but ISN'T&lt;/span&gt;
&lt;span class="n"&gt;abs&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;userPath&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HasPrefix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An attacker creates a symlink &lt;code&gt;workspace/data → /etc&lt;/code&gt; and requests &lt;code&gt;data/shadow&lt;/code&gt;. The prefix check passes. The symlink resolves to &lt;code&gt;/etc/shadow&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;Fix: call &lt;code&gt;filepath.EvalSymlinks&lt;/code&gt; before the prefix check.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the book covers
&lt;/h2&gt;

&lt;p&gt;12 chapters, 110+ pages, every example from production:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;MCP Protocol&lt;/strong&gt; — architecture, transports, JSON-RPC flow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quick Start&lt;/strong&gt; — running server in 5 minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Server Scaffold&lt;/strong&gt; — dual transport, health checks, graceful shutdown&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool Development&lt;/strong&gt; — schemas, validation, rate limiting, long-running ops&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resources &amp;amp; Prompts&lt;/strong&gt; — fixed/template resources, context-bundling prompts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authentication &amp;amp; Security&lt;/strong&gt; — bearer tokens, symlink defense, risk classification, API keys&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database Integration&lt;/strong&gt; — pgxpool, embedded migrations, pgvector semantic search&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Testing&lt;/strong&gt; — unit, integration, testcontainers, CI/CD&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment&lt;/strong&gt; — multi-stage Docker, Compose, Caddy HTTPS, Prometheus&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production Patterns&lt;/strong&gt; — sessions, events, multi-tenant, circuit breakers, "mistakes I made"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monetization&lt;/strong&gt; — Stripe billing, pricing models, distribution, case study ($10K MRR in 6 weeks)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Appendix&lt;/strong&gt; — client compatibility matrix, quick reference, LLM uncertainty handling&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The MCP economy is wide open
&lt;/h2&gt;

&lt;p&gt;17,000+ MCP servers exist. Less than 5% are monetized. The SDK gets 97 million monthly downloads. This is the mobile app store in 2009 — massive developer activity, almost no established business models.&lt;/p&gt;

&lt;p&gt;The book's final chapter covers how to monetize: freemium, usage-based, hybrid pricing, Stripe metering integration, and distribution across MCP marketplaces.&lt;/p&gt;

&lt;h2&gt;
  
  
  Get the book
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://shadowshao.gumroad.com/l/production-mcp-servers-go" rel="noopener noreferrer"&gt;Production MCP Servers with Go&lt;/a&gt;&lt;/strong&gt; — $39 on Gumroad.&lt;/p&gt;

&lt;p&gt;PDF + EPUB. 110 pages. 12 chapters. All code from production systems.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Questions? Drop them in the comments. I'll answer everything about building MCP servers with Go.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>mcp</category>
      <category>performance</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
