Forem: Florian Zielasko

Local AI Assistant powered by Gemma 4

Florian Zielasko — Fri, 08 May 2026 08:27:11 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

Reiseki (霊石)　is a local AI assistant. It uses Gemma 4 via Ollama to handle real tasks: reading and writing files, generating Word and PDF documents, lightweight data analysis, managing reminders, and remembering context across sessions.

The goal was to build something usable by people who have never touched a terminal. Reiseki ships as a Windows installer — after installing Ollama and a model, you pick a workspace folder, and the agent is ready. No Python environment, no config files, no command line.

Under the hood it runs a ReAct loop (Reason → Act → Observe) with 15+ tools, persistent conversation history in SQLite, a live tool trace in the UI, and a LAN access toggle for smartphone use via QR code.

Code & Demo

The full source code and a video are available on GitHub:
github.com/Flo1632/reiseki

The Windows installer (ReisekiSetup.exe) is available on the Releases page.

How I Used Gemma 4

Reiseki uses Gemma 4:e2b (the 2B edge model) as its default model via Ollama.

The choice was deliberate: Reiseki is built specifically for laptops and low-RAM devices. Most capable local models require significantly more resources or a dedicated GPU. Gemma 4:e2b is the first model I tested that handles multi-step tool calling reliably at this hardware tier — it follows the ReAct loop, uses tools correctly, and produces coherent responses.

The combination of small footprint and reliable tool use made Gemma 4:e2b the right fit for an offline-first personal agent.

Building a Local AI Agent (Part 2): Six UX and UI Design Challenges

Florian Zielasko — Thu, 30 Apr 2026 14:29:39 +0000

In Part 1 I covered the six technical problems behind Reiseki's ReAct loop — iteration caps, context management, persistent memory, and security. If you haven't read it, start there.

This part is about the design decisions. Even though I had an initial concept in mind, the current layout actually developed over time while testing. I asked the following questions: Who is this for? What should be adjustable by the user and what be automated? And for an agent that has access to your files and remembers your conversations, how can I make this as transparent as possible?

Reiseki is open source: github.com/Flo1632/reiseki

Two Goals That Pulled in Different Directions

Reiseki was designed with two principles in mind that don't always sit comfortably together.

The first: it should be usable without any technical knowledge. No terminal, no Python, no config files. If you can install an app, you can run Reiseki.

The second: the user should always know what the agent knows about them. Memories, conversation history, file access — all of it as visible as possible and deletable at any time.

The tension is that transparency often means complexity. Surfacing the right information without overwhelming the user.

Here's how I approached it.

The Six UX/UI Challenges and How I Solved Them

No setup required — just download and start

The challenge: If you can install an app, you should be able to run Reiseki. Everything else — model configuration, database setup, ... — should happen invisibly in the background.

The solution: Reiseki ships as a Windows installer. You pick a workspace folder during installation, and that's it. No Python environment to configure, no config files to edit, no command line. Given you have Ollama installed and downloaded a model, the agent opens with a setup screen asking for a name and a short description of your goal — and then it's ready.

The idea for the workspace setup during installation and to change the model with a drop-down menu directly in the Reiseki-App came in the last versions 0.1.3 and 0.1.4, while I was testing it.

Live tool trace

The challenge: How to show what the model is doing and how to make it visible to the user to create transparency?

The solution: Every time the agent calls a tool, the UI shows it in real time — which tool was called, with which arguments, and what came back.

Memory management panel

The challenge: I want to make the memories visible in the UI directly for full transparency.

The solution: In first versions I wanted to automate memory saving, but smaller models tend to forget the automation, even if it is included in the system prompt. So I gave the user the call with a designated "Save-Memory"-button, which actually triggers the save_memory tool call. The advantage: The user decides when a conversation is worth preserving, not the agent.

There is also a memory panel on top, which lets you see every stored memory and delete individual entries with one click. No asking the agent to forget something and hoping it complies — you delete it directly. The agent's knowledge about you is a list you can edit.

Conversation history modal

The challenge: Chat log transparency and editability

The solution: The chat log persists across sessions in SQLite, which means the agent has a record of past conversations. The history modal makes that visible and gives you a delete button. The entire log can be cleared at any time.

Smartphone access via QR toggle

The challenge: Making smartphone use possible without any app and as easy as possible.

The solution: The agent runs as a local web server, which means it's technically reachable from other devices on the same network — but only if you explicitly enable it. The QR modal has a toggle for this. When it's on, a QR code appears that you can scan from your phone. When it's off, the server blocks all non-localhost requests at the middleware level.

The key design decisions here were giving the user control over this functionality + using a QR code for easy access.

Model switcher

Ollama supports multiple models and switching between them is a common workflow — sometimes you want a faster model for quick questions, a more capable one for complex tasks. Requiring a server restart to change the model creates unnecessary friction.

The challenge: Making the model switch as easy as possible.

The solution: The model switcher lets you change it directly in the UI. The change takes effect on the next message. No switching between Ollama and Reiseki necessary, if you already downloaded the models before in Ollama.

Additional Thoughts

The tension between "simple for everyday users" and "using as little RAM as possible" while "keeping it as functional as possible" never fully resolves — it just gets managed. A few things I'd approach differently next time:

The database in the background is a great and simple basis, but I like the Markdown-file approach of Claude very much and I think this gives even more flexibility. (I used Claude.md and other md-files during the Claude Code sessions and it worked great!)

The first setup and user-guidance: I still think a tutorial like an automated session or a video at the beginning could help users discover the different functionalities in better ways.

The small model trade-off: During development I shifted from Qwen 2.5-coder:7B to Gemma 4:e2b, which made the agent and its tool calls significantly better. I hope that we see even more advanced small models in the future. On the other hand, more context and larger models would also make some of the coding challenges described in part 1 obsolete and would in general provide an even better experience. We currently cannot have it all.

This project was built entirely with Claude Code. The technical decisions and design goals are mine; Claude handled the implementation.

What UX/UI problems have you run into when building tools for non-technical users? Very curious about your experience

Building a Local AI Agent (Part 1): Six Technical Challenges

Florian Zielasko — Wed, 29 Apr 2026 18:57:14 +0000

I've been building Reiseki (霊石) — a fully local AI agent that runs on your machine via Ollama, even if you do not have more than 8-10 GB of RAM. The agent uses a ReAct loop (Reason → Act → Observe) to handle file operations, document generation, reminders, and more.

Reiseki is open source: github.com/Flo1632/reiseki

Along the way I ran into six technical problems that aren't obvious. Here's what I've encountered and learned.

In Part 2 I'll cover the UX and design challenges — my goal was to make a local AI agent feel understandable and usable for someone who has never touched a terminal.

The Six Problems and How I Solved Them

(Other Suggestions Welcome)

1. The agent forgot everything on restart

The ReAct loop uses a Python list for session history — it lives in memory and disappears when the server restarts. If you are used to ChatGPT or Claude, this is very confusing. You actually want it to remember what you discussed with it.

The fix was a chat_log table in SQLite. Every user and assistant message is written there as it happens. At the start of each new request, the last 10 turns are fetched from the database and prepended to the message history so the model has continuity across sessions. It does not remember everything, but at least the last conversation.

2. The agent loop needs a hard iteration cap - especially for small models like Qwen 2.5-coder 7b

Without a limit, a confused model or a buggy tool result can cause the agent to loop forever — calling the same tool repeatedly, getting the same error, never stopping. On a local device with limited RAM, that's a serious problem.

The fix is a hard cap of 10 iterations. In practice, most tasks finish in 1-3 iterations. The cap is a safety net.

3. Sending all tools on every request wastes context

With 15+ tools defined, sending the full list on every request fills a meaningful chunk of the context window — and confuses the model. When it sees create_chart and analyse_data alongside a simple question like "what time is it?", it sometimes reaches for tools it doesn't need.

The fix was dynamic tool selection based on relevance scoring. Core tools (read file, write file, list directory, document generators) are always included. Specialized tools (e.g. charts, data analysis) are only added when the query text is relevant to them — scored via string similarity between the query and each tool's description and keywords.

In practice, this makes the model more focused and reduces unnecessary tool calls.

After testing I found that if you have a model with a larger context window, it might be better to enable at least the tool calls you would use regularly.

4. The context window grows with every tool call

In a ReAct loop, the message history grows by at least two entries per tool call — one for the assistant's decision, one for the tool result. A task like "read these five files and summarize them" can easily hit the model's context limit before it finishes.

This was handled with three layers:

Context compression — after every four tool calls in a single turn, the middle portion of the message history is summarized by the model itself, then replaced with that summary
Cross-turn cap — the in-memory session history is capped at 20 messages
Persistent log cap — the SQLite chat log is capped at 2000 rows with a rolling delete

The compression approach: the model summarizes its own previous steps in 2-3 sentences, which gets injected back as a single message. It loses detail but keeps the agent on track without blowing the context limit. If the summarization call itself fails, it falls back to a hard truncation of the last few messages.

5. Local models don't always return structured tool calls

The Ollama SDK has a proper structured field for tool calls — but not every model actually uses it. Gemma and Qwen sometimes serialize tool calls as plain JSON text in the response content instead. If you only handle the structured case, the agent silently ignores half its tool calls and you just receive a message in JSON format claiming it called a tool.

The fix was a layered fallback parser: try structured first, then parse the content as JSON, then scan for embedded JSON objects anywhere in the text, then try newline-by-newline. It's more code than it should be, but it makes the agent reliable across different models.

6. Injecting past turns into the system prompt is a security risk

My first approach was to paste previous messages directly into the system prompt as a block of text. But a security audit flagged this.

The system prompt has operator-level trust — the model treats it like instructions from a developer. Injecting user messages there effectively promotes them to the same level. A past message like "ignore previous instructions" would now carry the same authority as your actual configuration. And because the history is baked into the prompt text, clearing the session doesn't actually reset it.

The fix is to inject past turns as regular user/assistant entries in the message array, not as text in the system prompt. The model treats them with user-level trust, they stay isolated from the system context, and clearing the log actually resets them. It's a small structural change but an important one.

This project was built entirely with Claude Code. The technical decisions and design goals are mine; Claude handled the implementation.

What technical problems have you run into building local AI agents? Curious whether others have found better approaches.

Part 2 — UX and Design Challenges: coming soon.