The Night Gemma 4 Changed How I Write Code

Rakesh Mondal — Sat, 09 May 2026 11:45:09 +0000

The night I stopped trusting the cloud

It was past midnight. I had a bug I could not explain, an API bill I
could not justify, and a growing discomfort I could not name.

Every prompt I sent to a cloud AI contained something real — a piece
of my actual project, my actual logic, my actual mistakes. And every
time I hit send, those thoughts traveled somewhere I did not control.

That night I pulled Gemma 4 locally for the first time.

I pointed it at a 200-line Python module and typed:

"What is wrong with how I have structured this?"

It did not compliment me. It called my error handling optimistic to
the point of being dangerous.

I checked. It was right. And nothing I typed ever left my machine.

That is when I understood what Gemma 4 actually is — not just a
smaller version of a cloud model, but a fundamentally different
relationship between a developer and their AI.

What Gemma 4 actually is (the version nobody explains clearly)

Gemma 4 is Google's open-weight model family released in 2025. The
weights are yours. You download them, run them, fine-tune them, and
ship them inside your own products without a single token touching a
third-party server.

But this generation is different from previous open models in three
important ways:

Native multimodal input. Gemma 4 models can process text and
images together in the same prompt — out of the box, no plugin
required. Feed it a screenshot of a UI bug and ask what is wrong.
Hand it a diagram and ask it to generate the corresponding code.
This changes what "local AI" means for real-world developer tasks.

128K context window. This is not a marketing number. A 128K
context window means you can feed Gemma 4 your entire codebase — not
just one file, not just one function. You can ask questions that span
modules, trace logic across hundreds of lines, and get answers that
understand the whole picture.

A range that runs everywhere. From a Raspberry Pi to a large-scale
server deployment, there is a Gemma 4 model for the hardware you
actually have.

The three variants — and which one is yours

This is the question every article avoids answering directly. I will
not do that.

Variant	Parameters	Runs on	Best for
`gemma-4-it-2b`	2 billion	Raspberry Pi, phones, edge devices	Embedded apps, offline tools, ultra-low latency
`gemma-4-it-9b`	9 billion	Laptop with 16GB RAM, mid-range GPU	Most developer tasks — start here
`gemma-4-it-27b`	27 billion	Workstation GPU (RTX 4090, A100)	Complex reasoning, long context tasks, production use

The it suffix means instruction-tuned — already aligned for
conversation and instruction following. The pt suffix means
pre-trained base, used for fine-tuning on your own domain.

My honest recommendation: run 9B first. It is fast enough for
real-time use, smart enough to reason properly, and fits on hardware
most developers already own. If it surprises you, you are done. If it
disappoints you for your specific use case, scale up to 27B.

Setting up locally — what actually works

I am not giving you a Colab notebook. I am telling you what I ran on
my own machine.

Requirements: Python 3.10+, 16GB RAM minimum, ~20GB disk space

# Install Ollama — the cleanest local inference runtime available
curl -fsSL https://ollama.com/install.sh | sh

# Pull Gemma 4 9B (adjust for your variant)
ollama pull gemma4:9b

# Run it immediately
ollama run gemma4:9b

Three commands. No API key. No billing alert at the end of the month.

To call it from Python:

import ollama

response = ollama.chat(
    model='gemma4:9b',
    messages=[
        {
            'role': 'user',
            'content': 'Review this function and be honest about what is wrong with it.'
        }
    ]
)

print(response['message']['content'])

For multimodal input — passing an image alongside text:

import ollama
import base64

with open('screenshot.png', 'rb') as f:
    image_data = base64.b64encode(f.read()).decode()

response = ollama.chat(
    model='gemma4:9b',
    messages=[
        {
            'role': 'user',
            'content': 'This is a screenshot of a UI bug. What is causing it and how do I fix it?',
            'images': [image_data]
        }
    ]
)

print(response['message']['content'])

That is real multimodal inference running entirely on your machine.
No cloud. No cost per token. No data leaving your environment.

What I actually tested — not benchmarks, real tasks

Benchmarks tell you how a model scores on exams. Here is how it
performs on the things developers actually do.

Codebase-level reasoning with 128K context

I fed Gemma 4 27B an entire Node.js project — 43 files, roughly
18,000 lines — and asked:

"Which module has the most hidden coupling to other modules and
why is that a problem?"

It identified a utility file that was quietly imported by 11 other
modules, explained why this created a hidden dependency graph that
would make future refactoring painful, and suggested a specific
restructuring approach.

No cloud model I prompted with a single file ever gave me that answer,
because no cloud prompt ever had the full picture.

Multimodal bug diagnosis

I took a screenshot of a broken CSS layout — a nav bar collapsing
on mobile — and passed it directly to Gemma 4 9B with the question:

"What is wrong with this layout and what CSS would fix it?"

It identified the missing flex-wrap and the hardcoded pixel width
on the nav items without seeing a single line of my code. It saw the
rendered output and reasoned backwards to the cause.

That is a workflow shift. Describing a visual bug in words is
imprecise. Showing it directly is not.

Fine-tuning for domain knowledge

The pt (pre-trained) variants are designed for specialisation. Using
QLoRA — Quantized Low-Rank Adaptation — you can teach Gemma 4 your
codebase's patterns, your documentation's tone, your domain's
vocabulary, without retraining the entire model:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model

model_id = "google/gemma-4-9b-pt"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True,
    device_map="auto"
)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: ~42M — roughly 0.5% of total model weights

You are not retraining from scratch. You are teaching it a dialect.
That is efficient, practical, and something any developer with a decent
GPU can do today.

Where it genuinely struggles — the honest section

I would be wasting your time if I only praised it.

The 27B model demands real hardware. On a machine without a capable
dedicated GPU, inference is slow enough to break the conversational
flow that makes local AI useful. If you only have CPU, start with 2B.

Context quality degrades at the edges. The 128K window is real, but
coherence towards the far end of a very long context is not equal to
coherence in the middle. For massive codebases, chunking strategically
still produces better results than naive full-context dumps.

Multi-step mathematical reasoning has limits. Complex proof chains
or deeply nested logical puzzles — the 9B model makes confident errors.
The 27B is significantly better but there is still a gap versus
frontier closed models on the hardest reasoning tasks.

Knowing these limits is not a criticism. It is how you use the right
tool correctly.

The thing the spec sheet will not tell you

I want to say something that gets lost in every model comparison.

When I ran Gemma 4 locally, I shared my actual database schema with
it. My actual API architecture. Conversations about real design
decisions in a real product with real users.

With cloud AI, every one of those prompts travels somewhere. Gets
logged somewhere. Possibly influences something somewhere downstream.

With Gemma 4, that conversation stays on the machine it runs on.

For independent developers. For students building things that matter
to them. For engineers at companies with data governance requirements.
For anyone working on something they are not ready to share with the
world yet.

Owning your inference is not a secondary feature. For a large class
of real use cases, it is the only feature that matters.

The decision that took me ten minutes to make

Do you need edge / mobile deployment?
└─ Yes → gemma-4-it-2b
Do you have a standard developer laptop (16GB RAM)?
└─ Yes → gemma-4-it-9b ← most people should start here
Do you have a workstation GPU?
└─ Yes → gemma-4-it-27b
Do you need the model to know your specific domain deeply?
└─ Yes → gemma-4-pt-[size] + QLoRA fine-tuning

Do not spend three days on this decision. Pull 9B tonight. You will
know within one conversation whether it fits your use case.

What it means that open models are this capable now

I have been writing software long enough to remember when "run AI
locally" was a novelty with no practical use.

Gemma 4 is not that.

It is a model that a solo developer — no enterprise contract, no
research budget, no special access — can run, query with images,
reason over an entire codebase, fine-tune on private data, and deploy
inside a product. Completely independently. Right now.

The frontier closed models will always be ahead on the hardest
benchmarks. That argument is not interesting anymore. The interesting
question is what happens when capable, private, fast AI inference
becomes something any developer can own.

The answer is being written by people doing exactly what you are doing
right now — reading about it, pulling a model, building something.

Gemma 4 is not trying to beat the biggest closed model. It is trying
to be the model that a million developers actually use, understand,
and make their own.

Looking at where it is today, I think it is already winning that race.

Start tonight

ollama pull gemma4:9b
ollama run gemma4:9b

Forem: Rakesh Mondal