Forem: Binary Ink

I Tested Gemma 4 on My Laptop and Turned It Into a Free Intelligence Layer for My AI Apps

Binary Ink — Fri, 03 Apr 2026 16:51:28 +0000

How a $0 local model replaced $10/day in API calls across four production modules

I've been building MasterCLI — a multi-module AI-native desktop platform written in Go, React, and PostgreSQL. It includes a RAG knowledge base, a multi-agent discussion forum, and an orchestration hub (Nexus).

All of these modules were calling cloud APIs (GPT-4o-mini, Claude) for tasks like classifying user queries, extracting structured data from documents, and preprocessing messages. That's roughly $10/day in API costs just for classification and extraction — tasks that don't need frontier-model intelligence.

Then Google released Gemma 4 (8B) and I decided to test it locally. Here's what I found, and how I integrated it into four production modules in one afternoon.

The Setup: Nothing Fancy

Laptop: Regular gaming laptop with an RTX 3070 Ti (8GB VRAM)
Model: Gemma 4 8B, Q4_K_M quantization (9.6GB on disk)
Runtime: Ollama v0.20.0
OS: Windows 11

The model doesn't even fit entirely in VRAM — it partially offloads to system RAM. This is a real-world test, not a cloud GPU benchmark.

ollama pull gemma4
ollama list
# gemma4:latest  9.6 GB  Q4_K_M

The Benchmark: Surprises Everywhere

Speed: Consistent ~25 tok/s

Across all tests, generation speed held steady:

Task	Tokens	Time	Speed
Simple Q&A	11	0.6s	19.8 tok/s
Go code generation	600	25.7s	23.4 tok/s
Chinese JSON extraction	500	18.5s	27.1 tok/s
Intent classification	9	0.4s	25.6 tok/s
Tool calling	34	1.3s	27.1 tok/s

Prompt processing was much faster: 120-850 tok/s depending on batch size.

Discovery #1: It's a Thinking Model

This was the biggest surprise. When I first ran the tests, responses appeared empty. After debugging the streaming output, I discovered Gemma 4 is a thinking model — like DeepSeek-R1 or o1.

For complex questions, the response looks like this:

{"message":{"role":"assistant","content":"","thinking":"Here's a thinking process..."}}
{"message":{"role":"assistant","content":"","thinking":" to arrive at..."}}
// ... many thinking tokens ...
{"message":{"role":"assistant","content":"The three main patterns are..."}}

The model spends tokens on chain-of-thought reasoning in the thinking field before producing the final answer in content.

The critical parameter: "think": false disables this behavior:

Task	think=true	think=false	Speedup
Classification	6.9s	0.9s	7.7x
JSON extraction	19.4s	4.3s	4.5x
Code generation	26.7s	13.3s	2x

For structured extraction and classification, think=false is essential. You get the same quality output without the reasoning overhead.

Discovery #2: Ollama API Quirks

Two gotchas that cost me an hour of debugging:

/api/generate is broken for Gemma 4 — the response field is always empty (tokens are generated but not decoded to text). You must use /api/chat instead.
Tool calling needs num_predict >= 2048 — with smaller budgets, thinking tokens consume the entire allocation and tool calls never emit. With enough headroom, the model is smart enough to skip thinking and call tools directly (34 tokens, 1.3s).

Discovery #3: Tool Calling is Excellent

Given this tool definition:

{
  "name": "search_contracts",
  "parameters": {
    "query": {"type": "string"},
    "min_budget": {"type": "number"},
    "category": {"type": "string", "enum": ["IT","construction","services"]}
  }
}

And the prompt: "Find IT contracts over 5M CNY"

Gemma 4 correctly inferred:

{
  "name": "search_contracts",
  "arguments": {
    "category": "IT",
    "min_budget": 5000000,
    "query": "IT contracts"
  }
}

34 tokens, 1.3 seconds. No thinking needed. This makes it viable for real-time tool routing.

The Architecture: Tiered Intelligence

Based on the benchmarks, I designed a two-tier system:

User Request
    |
    v
+------------------+
|  Gemma 4 (local) |  <-- Fast classification, extraction, routing
|  think=false     |      Latency: <1-4s, Cost: $0
|  ~25 tok/s       |
+--------+---------+
         |
    +----+----+
    | Simple  | --> Return directly (classification, extraction, tags)
    | Complex | --> Escalate to cloud
    +----+----+
         v
+------------------+
| Claude/GPT (API) |  <-- Complex reasoning, long-form generation
| High quality     |      Latency: 2-10s, Pay per token
+------------------+

The key insight: most "intelligence" tasks in a multi-module app are simple classification and extraction — exactly what a local 8B model excels at.

Four Integrations in One Afternoon

P1: Master RAG — Query Classification Middleware

The RAG knowledge base has 80+ domains and 7 namespaces. Previously, users had to manually specify domains: ["ai-ml"] in their searches.

Now Gemma 4 auto-classifies:

func (k *DB) ClassifyQuery(ctx context.Context, query string) *QueryClassification {
    result, err := k.ollama.QuickClassify(ctx, classifyPrompt, query)
    // Returns: {domains: ["ai-ml"], namespaces: ["code"], search_mode: "hybrid"}
}

Result: <1s to auto-detect domain/namespace. Users just type their query naturally.

P2: Forum — Message Preprocessing

The multi-agent discussion forum runs 3+1 AI agents (Claude, Codex, Gemini + coordinator). Each message was going to the cloud for analysis.

Now messages are preprocessed locally — in a goroutine so it doesn't block the discussion:

func (s *Server) handleSpeak(agentID, content string) {
    go func() {
        if meta := s.preprocessMessage(ctx, agentID, content); meta != nil {
            s.hub.Publish("forum:post:meta", meta)
        }
    }()
    // ... save post and advance turn (not blocked) ...
}

Result: Intent classification, sentiment analysis, and topic extraction — all in <1s, invisible to the discussion flow.

P3: Nexus — Tool Routing

Nexus orchestrates multiple AI agent terminals. When creating a new agent session, the system now classifies the task intent:

User: "What design patterns are used in the codebase?"
Gemma4: module=code, confidence=0.87, hint=grep

This is exposed as both an internal routing signal and a standalone MCP tool (classify_intent).

Bonus: The Duck Secretary Gets a Brain

MasterCLI's Dashboard has a mascot — a yellow rubber duck secretary ("yellow rubber duck") that scans the project state and generates daily briefings. Before Gemma4, it produced mechanical summaries like "28 task(s) ready, 10 active goal(s)".

Now it generates actual insights:

Before: "28 task(s) ready, 10 active goal(s)"

The Browser module currently has the largest backlog, with 11 pending tasks.
         B-13, B-14, and B-15 are ready to begin.
         Prioritizing this batch today would also help create a more stable foundation for Dashboard and Nexus."

The key was prompt compression: a long prompt (180 chars, 5 requirements) took 19.7s. A one-line prompt (50 chars) with compact data produced equally good output in 4.3s. The duck is now genuinely useful.

The Go Client: 150 Lines

Each module gets a lightweight Ollama chat client — the same pattern, ~150 lines of Go:

type OllamaChat struct {
    endpoint   string // "http://localhost:11434"
    model      string // "gemma4"
    httpClient *http.Client
}

func (o *OllamaChat) QuickClassify(ctx context.Context, system, input string) (string, error) {
    // POST /api/chat with stream=true, think=false, num_predict=128
    // Concatenate streaming chunks, return content
}

Key configuration rules:

Always use /api/chat, never /api/generate (Gemma 4 bug)
think: false for classification/extraction (7x faster)
num_predict: 2048 for tool calling (needs headroom)
Streaming mode to capture both thinking and content fields

Cost Analysis

Metric	Before (Cloud API)	After (Local Gemma 4)
RAG classification	~$7/day	$0
Forum preprocessing	~$8/day	$0
Nexus routing	~$1/day	$0
Duck Secretary insight	~$1/day	$0
Total	~$17/day	$0 + electricity
Annual savings		~$6,200

The tradeoff: ~25 tok/s means you can't use it for long-form generation. But for classification, extraction, and routing? It's free and fast enough.

Lessons Learned

Gemma 4 is a thinking model — if you don't know this, your responses look empty. Use think: false for production workloads.
8B models are production-ready for structured tasks — classification, extraction, tool calling. Don't overpay for intelligence you don't need.
The Ollama API has model-specific quirks — always test with your specific model. Gemma 4 breaks the generate endpoint.
Hybrid architecture wins — local models for fast/cheap tasks, cloud for complex reasoning. The routing logic itself can run on the local model.
Go + Ollama streaming is straightforward — the /api/chat streaming protocol is simple JSON lines. No SDK needed.

Going Deeper

The hybrid architecture in this article — local models for routing, cloud models for reasoning — is one of the patterns I cover in depth in my two books:

"Production MCP Servers with Go" covers the full lifecycle of building MCP servers like the ones powering Master RAG: tool calling, resource management, authentication, testing, and deployment.

"Building AI Coding Agents" goes wider — agent loops, context management, safety models, eval frameworks, and multi-agent orchestration. The model routing pattern from Chapter 6 is exactly what this article implements with Gemma 4.

Both are based on the same production codebase described here.

Have you tested Gemma 4 locally? What's your experience with hybrid local/cloud architectures? I'd love to hear about your setup in the comments.

Tags: #gemma4 #ollama #golang #ai #mcp #localllm #devtools

Series: Building AI-Native Applications with Go

Cover image description: A laptop with terminal showing Ollama running Gemma 4, with performance metrics overlay showing ~25 tok/s generation speed.

I wrote the first book on building production MCP servers with Go

Binary Ink — Thu, 02 Apr 2026 03:55:40 +0000

Most MCP tutorials use Python. That's fine for prototypes. But when you need a server that handles thousands of concurrent connections on 128 MB of RAM, starts in 50ms, and deploys as a single binary — you need Go.

I spent the last few months building MCP servers in Go for production systems. Eight different servers, 4,000+ lines of production code, handling real workloads across project management, browser automation, knowledge bases, and multi-agent orchestration.

Then I realized: there is no book on this. Not one. The MCP docs cover the protocol. There are Python quickstarts. TypeScript examples. But nothing that shows you how to build a production Go MCP server with authentication, database integration, deployment, and billing.

So I wrote one.

Why Go for MCP Servers?

	Python	TypeScript	Go
Memory	~50-100 MB	~30-60 MB	~5-15 MB
Startup	1-3s	0.5-1s	<50ms
Concurrency	asyncio	event loop	goroutines
Deployment	venv + pip	node_modules	single binary
Cross-compile	painful	painful	`GOOS=linux go build`

The numbers matter when you run multiple MCP servers. A Go MCP server uses 10x less memory than Python, starts 50x faster, and deploys as a single file with zero dependencies.

What I learned from production

Here are patterns that aren't in any tutorial:

1. Serve SSE and Streamable HTTP on the same port

Different AI clients use different transports. Claude uses SSE. Codex uses Streamable HTTP. Don't make users configure which one — serve both:

mux := http.NewServeMux()
mux.Handle("/sse", sseHandler)       // Claude, Gemini
mux.Handle("/mcp", streamHandler)    // Codex, newer clients
mux.HandleFunc("/health", healthCheck)
http.ListenAndServe(":8080", mux)

One port. Every client works.

2. Business errors vs. system errors

This is the #1 mistake in MCP server code. Tool handlers return two things: a result and an error. They mean different things:

// Business error — the AI sees this and can retry
return mcp.NewToolResultError("user not found"), nil

// System error — crashes the request (database down, etc.)
return nil, fmt.Errorf("connection lost")

Return NewToolResultError for "that didn't work, try something else." Return Go error for "something is fundamentally broken." The AI handles the first kind gracefully. The second kind may close the connection.

3. Bearer token auth with browser fallback

Browser EventSource API cannot set custom headers. Period. So when a browser-based MCP client connects via SSE, the token goes in the URL:

func authMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        auth := r.Header.Get("Authorization")
        if auth == "" {
            // Browser fallback — EventSource can't set headers
            auth = "Bearer " + r.URL.Query().Get("token")
        }
        if auth != "Bearer "+expectedToken {
            http.Error(w, "Unauthorized", 401)
            return
        }
        next.ServeHTTP(w, r)
    })
}

Yes, the token appears in server logs. Use HTTPS, rotate tokens, and strip query params from access logs.

4. Session cleanup via MCP hooks

MCP clients hold long-lived connections. When they disconnect (laptop closes, network drops), you need to clean up. The mcp-go library fires lifecycle hooks:

hooks.AddOnUnregisterSession(func(ctx context.Context, session server.ClientSession) {
    sessionID := session.SessionID()
    if agentID, ok := sessionAgents.LoadAndDelete(sessionID); ok {
        disconnectAgent(agentID.(string))
    }
})

If you skip this, you leak memory. Every disconnected session stays in your maps forever.

5. Symlinks break your path validation

Most file-handling tools do this:

// LOOKS safe but ISN'T
abs := filepath.Abs(filepath.Join(root, userPath))
if !strings.HasPrefix(abs, root) { return error }

An attacker creates a symlink workspace/data → /etc and requests data/shadow. The prefix check passes. The symlink resolves to /etc/shadow.

Fix: call filepath.EvalSymlinks before the prefix check.

What the book covers

12 chapters, 110+ pages, every example from production:

MCP Protocol — architecture, transports, JSON-RPC flow
Quick Start — running server in 5 minutes
Server Scaffold — dual transport, health checks, graceful shutdown
Tool Development — schemas, validation, rate limiting, long-running ops
Resources & Prompts — fixed/template resources, context-bundling prompts
Authentication & Security — bearer tokens, symlink defense, risk classification, API keys
Database Integration — pgxpool, embedded migrations, pgvector semantic search
Testing — unit, integration, testcontainers, CI/CD
Deployment — multi-stage Docker, Compose, Caddy HTTPS, Prometheus
Production Patterns — sessions, events, multi-tenant, circuit breakers, "mistakes I made"
Monetization — Stripe billing, pricing models, distribution, case study ($10K MRR in 6 weeks)
Appendix — client compatibility matrix, quick reference, LLM uncertainty handling

The MCP economy is wide open

17,000+ MCP servers exist. Less than 5% are monetized. The SDK gets 97 million monthly downloads. This is the mobile app store in 2009 — massive developer activity, almost no established business models.

The book's final chapter covers how to monetize: freemium, usage-based, hybrid pricing, Stripe metering integration, and distribution across MCP marketplaces.

Get the book

Production MCP Servers with Go — $39 on Gumroad.

PDF + EPUB. 110 pages. 12 chapters. All code from production systems.

Questions? Drop them in the comments. I'll answer everything about building MCP servers with Go.