Forem: arunkant

Vellum — a private, on‑device screenshot assistant powered by Gemma 4

arunkant — Thu, 21 May 2026 09:02:03 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

Vellum is a tray‑resident macOS app that turns any screenshot into a conversation. Hit ⌘⇧1 to drag a region, or ⌘⇧2 to grab the whole screen — the capture pops open in a chat window, the image is described and OCR'd in the background, and you can immediately ask follow‑up questions about what's on screen.

The interesting bit: Vellum runs Gemma 4 locally on Apple Silicon by default. No screenshot ever leaves the machine. There's no account, no API key, no cloud round‑trip — just a global hotkey and a vision‑language model living in your menu bar.

How I actually use it: over a few days I capture screenshots of things I read on the internet — articles, threads, diagrams, half‑finished thoughts — and let them pile up. Then one evening I sit down and review the whole stack. Because every capture is OCR'd by Gemma 4, the pile is fully searchable, and the review session turns into something more useful than a screenshot graveyard:

Capture now, think later. Grab anything interesting with ⌘⇧1 and keep reading — no context switch, no "where do I save this".
OCR everything automatically. Gemma 4 reads the text out of each image on capture, so I can search across days of captures by what was in them, not just when I took them.
Review in one sitting. An evening pass over the gallery surfaces patterns I'd never have spotted scrolling live.
New blog‑post ideas. Asking Gemma about a cluster of related captures regularly hands me an angle for something I want to write.
Organize with tags. Group captures into themes so the next review starts already sorted.

A few quality‑of‑life touches that make the review session flow:

One‑click "Copy for Slack / JIRA". From the capture's action bar, Vellum puts the image and a ready‑to‑paste blurb on the clipboard — Gemma 4's description plus the tags, formatted for each tool (Slack mrkdwn or Jira markup). Paste straight into a thread or ticket.
"Copy with shadow" for a polished share. One click composites the raw screenshot onto a transparent canvas with padding, rounded corners and a soft drop shadow, then drops it on the clipboard — the kind of framed shot you'd otherwise reach for a separate tool to make.
Plain "Copy image" for everything else.

Each action also has a ⌘1…9 shortcut, so the whole review‑and‑share loop stays on the keyboard.

Source + install script (one curl line for Apple Silicon): https://github.com/arunkant/vellum

Demo

Screenshots:

Install on macOS (Apple Silicon):

curl -fsSL https://www.arunkant.com/vellum/install.sh | bash

How I Used Gemma 4

Vellum bundles llama.cpp's llama-server as an extraResource and, on first use of the local provider, downloads:

gemma-4-E4B-it-Q4_K_M.gguf — Google's vision‑capable Gemma 4 E4B‑it, quantized to Q4_K_M (~5 GB).
mmproj-BF16.gguf — the multimodal projector that lets Gemma 4 actually look at pixels.

Both come from Unsloth's GGUF mirror of the official google/gemma-4-E4B-it. The server is launched lazily the first time the user takes a screenshot, then kept warm for the rest of the session.

The request flow is a regular OpenAI‑compatible vision call to localhost — Gemma 4 happily accepts an image part next to a text part:

// src/ai/local.ts
await net.fetch(`${serverEndpoint()}/v1/chat/completions`, {
  method: 'POST',
  body: JSON.stringify({
    model: 'gemma-4-e4b',
    messages: [{
      role: 'user',
      content: [
        { type: 'text', text: req.prompt },
        { type: 'image_url', image_url: { url: imageToDataUrl(req.imagePath) } },
      ],
    }],
    max_tokens: req.maxTokens ?? 2000,
    temperature: 0.2,
  }),
});

The launcher arguments are deliberately conservative so it fits comfortably alongside a browser and an IDE on a 16 GB MacBook:

// src/ai/llama-server.ts
const args = [
  '-m', modelPath(),
  '--mmproj', mmprojPath(),
  '--host', '127.0.0.1',
  '--port', String(port),
  '-c', '8192',
  '--no-webui',
];

Two prompts drive everything:

Auto‑analysis — a short instruction asking Gemma 4 to describe the app/page and pull out the most important visible text, runs the moment a capture lands.
Chat — every user message is forwarded with the original image still attached, so Gemma can ground each answer in the pixels rather than its own prior summary.

The same VisionProvider interface is also implemented by an OpenRouter‑backed provider, so users who want a beefier hosted model can swap in one without touching the rest of the app. Gemma 4 is the default, and it's the only path that works fully offline.

Multimodal Insights

A few things I learned shipping Gemma 4 to end‑users:

The 4B/E4B vision model is genuinely shippable. It comfortably reads UI text, error dialogs, and code editors, and an 8k context (-c 8192) is plenty for a single screen capture.
The multimodal projector is the easy thing to forget. Without --mmproj, the model will silently treat the image like noise. Vellum downloads it alongside the weights and refuses to start the server if either file is missing.
Q4_K_M is the right tradeoff for laptops. It keeps the whole thing under ~5 GB on disk, leaves headroom for a browser, and the quality cost on screenshot tasks is invisible to me in practice.
I run Gemma at a low temperature: 0.2. Screenshot Q&A wants literal, deterministic answers rather than creative ones, so a low temperature is the safer default.

Why Gemma 4 for this app

I built Vellum around a habit: capturing things I read over days, then reviewing the pile in one sitting. That only works if every capture is OCR'd and searchable the moment I take it — and I'm not about to stream days of half‑formed reading habits and private dashboards to a third‑party endpoint just to make that happen. An open‑weights vision model I can bundle and ship inside an Electron app is the only way that problem actually gets solved: it runs the moment I hit ⌘⇧1, with no account, no API key, and no network hop. Gemma 4 E4B‑it is the first model in this size class where the OCR and the follow‑up answers are consistently good enough that I stopped reaching for the cloud fallback.

Tech stack

Model: google/gemma-4-E4B-it (Q4_K_M GGUF, BF16 mmproj) via Unsloth
Runtime: llama.cpp llama-server, bundled per‑arch as an Electron extraResource
App shell: Electron + Vite + TypeScript, packaged with Electron Forge
Storage: SQLite for screenshots + AI results + chat history; images on disk
Auto‑update: GitHub Releases + a tiny installer script served from GitHub Pages

Why I'm Building SaaS in 2026

arunkant — Wed, 29 Apr 2026 12:42:21 +0000

Every few months, someone declares SaaS dead. The argument has gotten louder in 2026: why pay for vertical tools when an LLM and a GitHub Action can do the same job in an afternoon?

I've heard this pitch enough times - and built enough of those afternoon GitHub Actions myself - to know where it breaks. The demo always works. The Tuesday morning two months later, when the upstream API changes its response shape and your "AI agent" silently ships nonsense to customers, is where the wheels come off.

I think SaaS in 2026 is healthier than the obituaries suggest, but the conversation around it is shifting in ways worth paying attention to.

SaaS Isn't Dying, But the Conversation Is Shifting

You've seen the takes. "SaaS is dead." "AI agents will replace every vertical tool." "Why buy software when you can just prompt an LLM to do it in-house?"

I don't buy it.

Those takes confuse the interface with the infrastructure. Yes, AI changes how we interact with software. But the need for shared, maintained, specialized tools doesn't vanish just because you can spin up an Autogen crew in ten minutes.

If anything, the more AI floods the market, the more valuable it becomes to have a concrete workflow that actually works while you sleep.

The Agent Trap: Swiss Army Knives Made of Jelly

In-house AI agents are seductive because they're infinitely flexible. The same agent can summarize tickets, draft emails, assign blame for bugs, and probably order pizza if you give it a DoorDash API key.

But that flexibility comes at a hidden cost: reliability.

An agent is essentially a reasoning layer wrapped around hope. It guesses at priorities. It hallucinates whether a PR description is customer-facing or internal. It breaks when an upstream tool sneezes. And every time a new model drops, you're back to prompt-engineering a Friday afternoon away.

The real cost isn't the OpenAI bill. It's you, at 11 PM, debugging why your changelog included "fix typo in admin panel" in the email that just went to 10,000 users.

The Framework: Agent First, Workflow Second

This isn't an argument against agents. It's an argument for knowing when to use them.

The pattern I keep coming back to:

1. Use agents to figure out the workflow.
What do internal teams actually want - raw ticket lists grouped by squad, or plain-English summaries? Should security patches get a red border? Should refactors be auto-filtered? Agents are great for exploration. You can ask, iterate, and throw away.

2. Convert the agent into a concrete workflow.
Once you know the shape of the work, stop treating it like a conversation. Hard-code the rules. Build the validation layer. Lock the integration. Turn the fuzzy agent into reliable software that does the same correct thing every Tuesday at 9 AM.

3. Mix intelligence and consistency where each belongs.
Use AI for the parts that benefit from intelligence - reading unstructured tickets, understanding context, writing human prose. Use software for the parts that require consistency - formatting, routing, permissions, delivery.

The mistake most teams make is stopping at step one and shipping the agent.

Build vs. Buy: The Hidden Maintenance Tax

There's another argument I keep hearing: "We'll just build our own AI tools internally. It's cheaper, and our data stays in-house."

Sure. But agents are not set-and-forget.

You still need people to maintain them. New API version from JIRA? Fix the parser. New model release? Retest all your prompts. Linear changed their webhook format again? Back to the logs. You've built a fragile ETL pipeline dressed up as a chatbot, and now you're the on-call engineer for a paragraph generator.

There's also a quieter cost most teams don't price in: external tools learn faster.

If every company builds its own changelog bot, you've got 500 engineering teams each solving the same edge cases in isolation. One team figures out how to handle JIRA sub-tasks. Another figures out GitHub co-author attribution. Nobody shares notes. A shared tool learns from all of them. Your internal agent only learns from you. Over time, that compounds.

What This Looks Like in Practice

Imagine the work of generating release notes done right.

You connect to your JIRA board (or GitHub, or Linear). You define your audiences - maybe an internal Slack channel for the engineering team and a public page for users. From your earlier exploration, you already know the rules:

Internal changelogs need ticket IDs and squad assignments.
External changelogs need human-readable context, no jargon.
Security patches always get highlighted.
"WIP" and "refactor" tickets get filtered out unless manually overridden.

Those rules become a workflow. Every release, it reads your tickets, applies the logic, generates the prose, and delivers it to the right place. It doesn't get creative. It doesn't forget the rules because it's Tuesday. It just works.

You get the intelligence of AI - it understands your messy tickets - wrapped in the reliability of actual software. That's the shape of the SaaS I want to use, and it's the shape of the SaaS I'm building with Releasedog.

SaaS in 2026 Is the Plumbing, Not the Hype

I'm not skeptical of AI. I'm skeptical of where people are pointing it.

Agents are incredible demos. They're terrible infrastructure.

The future belongs to tools that take AI's flexibility and trap it inside software's reliability. Use agents to explore. Use workflows to run your business. And for the love of your future self, stop maintaining that Friday-night GitHub Action that generates your changelog.

Your 11 PM self will thank you.