Forem: Hashevolution

Gemma 4 가 갑자기 답을 못 했다 — 외부 협업이 24시간 만에 root cause 찾아낸 이야기

Hashevolution — Fri, 22 May 2026 08:51:00 +0000

TL;DR (한 문단)
자메스 (PROJECT JAMES, 로컬 Graph-RAG 엔진) 의 4개 인지 단계가 gemma4:e4b 에서 deterministic 하게 빈 응답 을 반환하는 패턴을 2026-05-18 에 fair-witness 보고서로 공개. 며칠 후 Ali Afana (Provia 창업자) 가 본인의 Gemma 4 walk-back article 에서 자메스를 3rd cross-validation context 로 인용. 자메스 측 단일 변수 실험으로 mechanism 확정: 모델이 visible output 첫 token 전에 ~500 token 의 hidden reasoning 을 소비. Cap 이 이 floor 이하면 100% 빈 응답. 4-line code change (PR #399) 머지 완료. 외부 → 내부 cross-validation, mechanism quantification, production fix 모두 24시간 안에.

시작: "왜 빈 응답이지?" 자메스 (PROJECT JAMES) 는 로컬에서 돌아가는 Graph-RAG 시스템입니다. Ollama 위에 Gemma 4 의 efficient 변형 (gemma4:e4b, 4B 파라미터) 을 기본 모델로 사용합니다. v0.3 의 cognitive middleware layer 가 phase 2 를 출하하면서 — query rewriter, planner, reflect, verify, fact-check 같은 단계가 추가되었는데 — 이상한 패턴이 발견됐습니다. 코드 같은 모델인데 어떤 stage 는 잘 작동하고, 어떤 stage 는 무조건 0자 응답. gemma3:12b (12B) 로 바꾸면 9/9 통과. 모델 변경만으로 해결되지만, 왜 그런지가 미궁. 2026-05-18 에 fair-witness 보고서를 dev.to 에 게시: 4 가지 가설을 제시하고 답을 단정하지 않음. 가설 A: 4B 모델의 메타-추론 capacity floor 가설 B: 짧은 구조화 prompt 에서 early stop-token 가설 C: 한국어 지시 + 영어 JSON 키 혼합 confusion 가설 D: JAMES 측 prompt truncation 버그 외부 데이터 환영. 외부 reader 가 어느 가설을 falsify 하든 confirm 하든 보고서에 누적.
Ali Afana 의 walk-back — 다른 deployment context 에서 같은 패턴 며칠 후, Ali Afana 가 dev.to 에서 Gemma 4 의 다른 변형 (31B Dense vs 26B MoE) 에 대해 본인의 분석을 공개했습니다. 첫 주장: "두 architecture 의 동작 차이는 architecture 때문이다." 그런데 Robin Converse (Triava Labs) 가 본인의 sovereign Ollama 환경에서 단순 검증을 했습니다 — max_tokens cap 을 풀고 같은 시나리오를 돌렸더니 18/18 다 통과. 그녀가 Ali 에게 던진 질문: "managed Gemini 쪽에서 cap 을 풀면 어떻게 되나요?" Ali 가 단일 변수 재실험: max_tokens 400 → 4096. Dense 12/12, MoE 12/12 — 모두 회복. 그 결과로 Ali 는 본인의 article 을 공개적으로 walk-back: "차이는 architecture 가 아니라 token cap 이었다." walk-back article 에서 그는 자메스의 production default 를 3rd cross-validation context 로 명시: Source Context Test Result Robin Converse sovereign Ollama, uncapped 6 시나리오 × 3 온도 18/18 Ali Afana managed Gemini, 400 → 4096 12 calls 12/12 회복 JAMES (자메스) local Ollama, default 200/400/400/400 5/6 stages 빈 응답
자메스 측 검증: V3' 단일 변수 실험 자메스 코드를 점검해 보니 충격적인 일치: Stage 자메스 default Ali 의 failing cap query_rewriter.py:46 200 400 planner.py:43 400 400 ← 정확 일치 reflect.py:54 (CRITIQUE) 400 400 ← 정확 일치 verify.py:69 (FACT_CHECK) 400 400 ← 정확 일치 자메스의 production default 가 정확히 Ali 의 failing threshold 였습니다. 우리도 모르고 있었던 일치. 이제 단일 변수로 검증할 차례. V3' 라고 명명한 사내 실험: V3'.a — query_rewriter stage (n=10 per cap) 코드 이게 의미하는 바: gemma4:e4b 가 visible output 첫 token 직전까지 ~500 token 의 hidden reasoning 을 소비합니다. Cap 이 이 floor 이하면 모델은 visible byte 하나도 emit 못 함. 100% deterministic empty. Ali 가 본인 article 에서 "starving the reasoning layer" 라고 비유한 패턴을 토큰 수준에서 정량 측정 한 셈입니다. V3'.b — planner stage (n=10 per cap) 코드 Cross-stage 진단: Metric V3'.a (cap 200) V3'.b (cap 400) 해석 Default-cap latency 2.1s 4.3s Cap 의 2배 → 시간도 정확히 2배 4096-cap latency 5.3s 7.1s +1.8s for planner 의 추가 reasoning 200 cap 의 latency 가 2.1s, 400 cap 이 4.3s — 선형 비례. 즉 ~500 token 의 reasoning floor 가 stage 와 무관한 모델 수준 특성. Cap 만큼의 시간을 선형으로 소진하다가 visible output 1 byte 도 emit 못 한 채 종료. 가설 공간 정리 ✅ B (token budget): 확정 — mechanism 까지 측정 ❌ A (4B floor): 사실상 기각 — 같은 모델이 cap 만 풀면 정상 작동 🤷 C, D: 변동 없음 (검증 안 됨) ⏸ E (-tag 후처리): cross-stage 일관성으로 약화
The Fix — 4 line code change PR #399: 4 개 stage 의 DEFAULT_MAX_TOKENS 상수를 4096 으로 bump. Diff 각 변경에는 stable-WHY 코멘트 추가 — 미래 maintainer 가 "왜 4096 인가" 를 코드만 보고 이해할 수 있도록. STEP 7 bench (13개 query 회귀 테스트, gemma3:12b 권장 모델 기준): 13/13 baseline tolerance 내. 변경의 비파괴성 확인. Squash-merge 머지 완료.
향후 진행 즉시 (1-3일) V3'.c (reflect.critique) + V3'.d (verify.fact_check) post-merge validation. 같은 protocol, 같은 모델, 같은 default 400 — 동일 패턴 재현 강한 prior. Unexpected drift 면 single-line revert synth.web_summary 의 inline max_tokens=300 (core/reasoning/pipeline_synth.py:141) 도 ~500 floor 아래 — 별도 PR 로 fix 중기 (Mid-June) Ali Afana 의 Gemini backend implementation PR 도착 예정 (Track 1 Provider contract 가 이미 설정해놓은 surface) Track 3 STEP 7 cross-experiment: 자메스 (Ollama local) + Ali (Gemini API) 의 swap eval — 같은 wiki corpus 에서 두 backend 의 retrieval-conditioning + synthesis layer 비교 장기 (Joint piece, Track 5) Robin Converse 의 temperature sweep post + 자메스 cross-experiment + Ali Gemini 결과를 3-name joint piece 로 출판 가제: "3 contexts, 2 architectures, 1 mechanism" — 세 운영 환경에서 같은 mechanism 을 관찰한 협업 사례
이 이야기에서 배운 것 기술적으로 LLM 의 "빈 응답" 은 종종 모델 capability 가 아니라 budget 부족. Cap 이 hidden reasoning floor 이하면 visible output 전에 cap 도달. gemma4:e4b 의 hidden-to-visible token ratio 는 약 5-6:1 (단일 stage 측정). 이건 모델 수준 특성으로 보임. 한 prompt 가 require 하는 reasoning budget 을 측정 없이 cap 잡으면 deterministic 실패의 함정. 운영 default 는 모델별로 floor 측정 후 결정해야 함. 협업으로 외부 사람이 본인의 hypothesis 를 honest 하게 walk-back 하는 것 이 협업의 가장 큰 가치. Ali 가 본인 article 을 공개적으로 정정한 결과로, 세 명의 deployment context 가 24시간 안에 cross-validation 완성. 공개 fair-witness 보고서 의 가치 — 외부 사람이 그것을 reference 로 인용하면서 연구의 chain 이 형성됨. 닫힌 연구 노트에서는 불가능한 형식. Single-variable test 의 힘. Ali 가 본인의 가설을 검증한 방식, 우리가 그것을 자메스에서 검증한 방식, 모두 한 변수만 바꿔서 다른 모든 것 고정. Mechanism 격리에 필수. 링크 자메스 fair-witness 보고서 (2026-05-18): https://dev.to/hashevolution/5-empty-responses-from-gemma4e4b-4-hypotheses-0-root-cause-1ggd Ali Afana walk-back article: https://dev.to/alimafana/i-raised-gemma-4s-token-cap-the-dense-model-stopped-refusing-2gf3 자메스 GitHub repo: https://github.com/Hashevolution/James-RAG-Evol PR #399 (cap fix 머지됨): https://github.com/Hashevolution/James-RAG-Evol/pull/399

5 empty responses from gemma4:e4b. 4 hypotheses. 0 root cause.

Hashevolution — Mon, 18 May 2026 08:49:44 +0000

dev.to — Gemma 4 Challenge submission (Write track)

Drafted: 2026-05-18
Track: Write about Gemma 4 ($100 × 5 winners)
Source material: gemma4-e4b-cognitive-stages-eval.md — internal fair-witness report (PR #307)
Companion submission: Build track piece on E4B model choice
Submission deadline: 2026-05-24 23:59 PDT
Winners announced: 2026-06-04
Tags: devchallenge, gemmachallenge, gemma, ollama

Why a Write-track submission in addition to the Build-track one

The Build-track submission (Building a Mini Palantir on gemma4:e4b — 128K Context Lets the Graph Actually Be Graph-RAG) made an intentional model choice argument: 128K context window > parameter count, so E4B was right for the Graph-RAG retrieval-conditioning stage.

That argument held — and it is still the strongest single thing E4B does in this project.

But once the v0.3 Cognitive Middleware Layer started shipping Phase 2 stages (verification, planner, tool router, query rewriter, fact-check), a second pattern showed up that the Build-track piece could not honestly absorb: E4B silently returns empty responses on five of the nine cognitive stages, while the same prompts on Gemma 3 12B succeed end-to-end.

This is the Write-track piece that documents that pattern honestly, without retreating from the Build-track claim. Same author, two articles, two facets of the same model.

The challenge judging rubric for the Write track is:

Clarity and depth of explanation
Originality of perspective or insight
Practical value to the community
Quality of writing

A fair-witness field report meets all four at once: it shares reproducible numbers, an explicit "I don't know yet" stance on root cause, and a set of open questions that other operators can act on.

Suggested title (pick one)

#	Title	Why
A	5 empty responses from `gemma4:e4b`. 4 hypotheses. 0 root cause. A fair-witness field report from a Graph-RAG production.	⭐ Strongest hook — number-led, names a tension (no resolution), promises honesty.
B	Where Gemma 4 e4b runs out of room: empty responses on meta-reasoning stages	Clearer technical framing, slightly less click-worthy
C	Gemma 4 e4b: brilliant at synthesis, silent on meta-reasoning. A field report.	Bridges the strengths and weaknesses in the title itself

Recommended: A. Numbers-led titles outperform on the dev.to feed; the "0 root cause" half signals the writing is honest rather than gloating.

Cover image

Use reports/promo-assets/screenshots/03-chat-graph-paths.jpg — the chat-UI screenshot with "그래프 경로 47개 보기" surfaced. It primes the reader for "this writer ships a real Graph-RAG pipeline" before the article gets into the failure mode.

Download URL: https://github.com/Hashevolution/James-RAG-Evol/blob/main/reports/promo-assets/screenshots/03-chat-graph-paths.jpg?raw=true

Submission body (copy-paste into dev.to editor)

*This is a submission for the [Gemma 4 Challenge: Write about Gemma 4](https://dev.to/challenges/google-gemma-2026-05-06)*

## TL;DR

`gemma4:e4b` (4 B parameters, the "efficient" Gemma 4 build) **excels at long-form natural-language synthesis from a 5 KB retrieved context** in my Graph-RAG project. It also **silently returns empty responses on five short meta-reasoning stages** — query rewrite, plan decomposition, web summary, self-critique, fact-check. Same model, same backend, same `task` parameter. Swapping to `gemma3:12b` made all five succeed without touching a single prompt.

I have data. I do not yet have a root cause. Posting this as a fair-witness field report in case other local-LLM operators have seen the same pattern (or have a prompt-side fix that doesn't require jumping to a 12 B model).

This is a companion to my earlier [Build-track submission](https://dev.to/hashevolution/building-a-mini-palantir-on-gemma4e4b-128k-context-lets-the-graph-actually-be-graph-rag-33fk), which argued for E4B on the basis of its 128 K context window. That argument is still right — for the synthesis stage. The five meta stages are where the 4 B variant runs out of room.

## Setup (reproducible)

### Project

[**PROJECT JAMES v0.3.x**](https://github.com/Hashevolution/James-RAG-Evol) — a local-first Graph-RAG reasoning engine. MIT-licensed, Ollama-only, no cloud LLM dependency. v0.3.0 shipped the Cognitive Middleware Layer architecture; v0.3.x is landing its phases incrementally — verification engine, planner, tool router, query rewriter, fact-check.

Relevant stages of the cognitive layer:

| Stage | Purpose | Prompt shape |
|---|---|---|
| `query_rewrite` | Rewrite the user question for retrieval | Korean/English instruction → JSON `{"rewritten": "..."}` |
| `plan.decompose` | Break a multi-aspect question into ≤ 5 subtasks | Instruction → JSON `{"subtasks": [...]}` |
| `synth.rag` | The actual long-form answer | System prompt + retrieved context (~5 KB) + Korean question → Korean prose answer |
| `synth.web_summary` | Summarize fetched web results | Instruction + web snippets → short Korean summary |
| `reflect.critique` | Critique the draft answer | Draft + instruction → Korean critique text |
| `verify.fact_check` | Audit claims against source docs | Answer + sources + instruction → JSON `{"grounded": bool, "unsupported": [...]}` |

All stages route through one Ollama backend adapter and use the same `JAMES_LLM_MODEL` env var. Whatever model is named, every stage talks to it the same way.

### Environment

- OS: Windows 11, PowerShell
- Ollama: latest mid-May 2026 build
- Models installed locally: `gemma4:e4b` (9.6 GB, ~4 B params), `gemma3:12b` (8.1 GB, ~12 B), plus a few others irrelevant to this report
- All `JAMES_ENABLE_*` cognitive flags set to `1` in the same shell before launching the server

### Test query

```


BlackRock 과 Vanguard 의 ETF 전략 차이를 비교해줘


```

A real Korean retrieval question. Intent classifier picked `retrieval` correctly. Document corpus contains ~10 finance documents matching the topic.

## What I observed with `gemma4:e4b`

Direct quote of the server console (one query, all stages enabled):

| Stage | LLM call type | Latency | Response size | Result |
|---|---|---|---|---|
| INTENT classify | `task=classify` | 9.1 s | **9 chars** ("retrieval") | ✅ OK |
| `query_rewrite` | `task=general` | 2.1 s | **0 chars** | ❌ empty |
| entity extract | `task=extract` | 9.5 s | **452 chars** (JSON of 9 entities) | ✅ OK |
| `synth.web_summary` | `task=general` | 4.0 s | **0 chars** | ❌ empty |
| `synth.rag` | `task=general` | 13.7 s | **2 690 chars** (Korean prose) | ✅ OK |
| `reflect.critique` | `task=general` | 4.2 s | **0 chars** | ❌ empty |
| `verify.fact_check` | `task=general` | 4.3 s | **0 chars** (prompt 4 319 → truncated to 4 000) | ❌ empty |

The empty-response path is taken when Ollama returns HTTP 200 but `response: ""` — the server replied successfully, the model just produced zero tokens. JAMES logs it as `gemma.empty_response`.

### What's striking

- **The 5 empty responses cluster at ~2–4 seconds.** Not a timeout. The per-stage budget is 10–30 s; the model decided it was done.
- **The two successful `task=general` calls** (entity extract: JSON; synth.rag: long Korean prose) **took 9.5 s and 13.7 s.** Same backend, same model, same `task` parameter — only the prompt shape differs.
- **The pattern is consistent across multiple trials.** Run the same query three times back-to-back and the same five stages are empty each time.

## Control — same prompts on `gemma3:12b`

Same query, same flags, no other changes. Single env-var swap, restart server:

| Stage | Latency | Response | Result |
|---|---|---|---|
| `query_rewrite` | 0.91 s | "BlackRock 및 Vanguard의 ETF 투자 전략과 포트폴리오 구성 방식의 차이점을 비교 분석해줘" — meaning-preserved keyword expansion | ✅ |
| `plan.decompose` | 1.33 s | 3 subtasks (BlackRock 조사 / Vanguard 조사 / 비교 분석) | ✅ |
| `synth.rag` | 9.6 s | 2 690-char Korean answer | ✅ |
| `reflect.critique` | 7.98 s | "## 답변 초안 비판적 검토 — 모순 / 사실 오류 …" — coherent meta-critique | ✅ |
| `reflect.revised` | 9.19 s | revised answer based on critique | ✅ |
| `verify.fact_check` | 1.17 s | `{"grounded": true, "unsupported": []}` — valid JSON | ✅ |

Full 9-step trace renders end-to-end. Wall-clock ~39 s. Same prompts. Same wiring. Same backend.

**This is the punchline: nothing changed except the model name.**

## Where Gemma 4 e4b still wins

Staying fair to the model:

- Long-form synthesis from a 5 KB retrieved context — the project's most-frequent stage — handled well at 13.7 s for 2 690 chars of genuinely useful Korean prose.
- JSON entity extraction with a 9-entity schema returned 452 chars of clean JSON at 9.5 s.
- Single-token classification — emit exactly one of seven mode strings — was fine.

The model is not "broken." It ships real Graph-RAG answers. The narrow failure mode is a second class of prompts: **short, structured, meta-instructional**.

## The failure pattern

```


✅ succeeds    long context + free-form Korean prose
✅ succeeds    short instruction + emit 1 token from a finite vocab
✅ succeeds    rich context + emit one JSON object describing the input
❌ empty       short context + emit JSON that critiques / restructures / audits the input


```

The five empty responses share three traits:

1. **The model is asked to act on a model output** — rewrite the user query, critique a draft, audit claims.
2. **The expected output is short and structured** — a few sentences, or a tight JSON object.
3. **The prompt mixes Korean instructions with English JSON schema keys** — e.g. `{"rewritten": "..."}` or `{"grounded": true, "unsupported": []}`.

A natural-language paraphrase (synth.rag) avoids all three. A JSON entity extraction has trait 3 only, and that one passes. The cluster of all three is what seems to silence the model.

## Four working hypotheses

I have data but not a root cause. Four candidate explanations, listed by my own subjective likelihood:

### A. Meta-reasoning capacity at 4 B is the floor

Critique / verify / decomposition prompts ask the model to reason *about* another reasoning artifact. The empirical literature on small open-weights models (Qwen 2.5-3B, Phi-3-mini, Gemma-2-2B, …) consistently shows the meta-reasoning gap is the first capability to drop below ~7 B params, while paraphrase-from-context survives much smaller. If this is right, no prompt-side fix exists for E4B on these stages.

### B. Early stop-token emission on short structured prompts

Ollama returning `response: ""` on a 2–4 s call (well below the timeout) is consistent with the model emitting EOS / `<end_of_turn>` immediately. Possibly the chat template wrapping resembles a completed conversation when the user prompt itself looks like an instruction-only frame with no input data attached.

### C. Korean instruction + English JSON schema confusion

The five failing prompts all mix Korean directive language with English-key JSON output. The two succeeding `task=general` calls don't (entity extract uses Korean prompt → Korean-content JSON; synth.rag is all Korean). Worth testing whether an all-Korean schema (e.g. `{"재작성된_질의": "..."}`) would change anything.

### D. JAMES-side prompt-truncation artifact

The `verify.fact_check` log shows `prompt 4 319자 → 4 000자 축약` — JAMES capped the prompt at 4 000 chars, which likely chopped the closing brace of an embedded JSON example in the system prompt. If true, this is a JAMES bug, not a Gemma 4 bug — but it would only explain `verify.fact_check`, not the other four empty responses.

The report explicitly **does not** advocate for a single hypothesis — that is the work this feedback round is asking the community to fund.

## What I'd love feedback on

If you've used `gemma4:e4b` (or `gemma4:e2b`) and have data points either way, I'd like to know:

1. Have you seen the same "empty response on short structured prompts" pattern? Especially critique-of-a-draft, JSON schema audit, query rewrite.
2. Did a prompt-engineering change rescue it on your setup? Different chat template, different `num_predict`, different temperature, all-one-language prompts, anything else.
3. Does `gemma4:e2b` show the same pattern, or is it specific to E4B?
4. Does the same prompt set behave on `gemma4:31b-dense` / `gemma4:26b-moe` if you have one of those provisioned?
5. Is there a known issue with Ollama + Gemma 4 + JSON-output prompts in your experience?

Project's stance on next steps:

- Default model swap to `gemma3:12b` is already done locally. `gemma4:e4b` stays available — its long-context synthesis is the project's bread-and-butter stage.
- A follow-up PR (option A2) will let operators wire individual cognitive stages to different backends, so E4B can keep `synth.rag` while a heavier model takes the meta stages.
- We will **not** patch JAMES's prompt shapes specifically to coax E4B into responding on these stages until we understand whether the empty response is the model declining, the chat template misfiring, or a JAMES-side truncation bug.

## Reproduction

If you want to reproduce — or, more usefully, to falsify — the report on your own corpus:

```

powershell
# 1. Install JAMES (one-liner, MIT, no cloud)
git clone https://github.com/Hashevolution/James-RAG-Evol
cd James-RAG-Evol
python -m pip install -r requirements.txt

# 2. Make sure the two models are local
ollama pull gemma4:e4b
ollama pull gemma3:12b

# 3. Enable the five cognitive stages
$env:JAMES_ENABLE_QUERY_REWRITE = "1"
$env:JAMES_ENABLE_PLANNER       = "1"
$env:JAMES_ENABLE_REFLECT       = "1"
$env:JAMES_ENABLE_VERIFY        = "1"
$env:JAMES_ENABLE_FACT_CHECK    = "1"

# 4. Test with Gemma 4
$env:JAMES_LLM_MODEL = "gemma4:e4b"
python server_llmwiki.py
# In another shell, send a retrieval query, e.g. the same BlackRock vs Vanguard line above.
python scripts/replay_trace.py --recent

# 5. Control: Gemma 3
$env:JAMES_LLM_MODEL = "gemma3:12b"
python server_llmwiki.py
# Same query, same trace command — all 9 stages succeed


```

If you publish your own numbers — X / GitHub issue / Reddit / dev.to comment — please tag `#JAMES` or open an issue on the [repo](https://github.com/Hashevolution/James-RAG-Evol). I'll link it back to this report.

## A note on the companion piece

This Write-track submission and the [Build-track submission](https://dev.to/hashevolution/building-a-mini-palantir-on-gemma4e4b-128k-context-lets-the-graph-actually-be-graph-rag-33fk) are deliberately contradictory in tone — one defends the model choice, the other documents where the same model falls short on a different class of prompts. Both are honest readings of the same model under different conditions. I think the contradiction is the point: writing about Gemma 4 useful for the community has to include both halves, not just the half that fits the marketing arc.

If you've read [Ali Afana's parallel piece on MoE vs Dense](https://dev.to/alimafana/i-added-three-rules-to-gemma-4-the-moe-searched-the-dense-model-refused-1j18), you'll recognize the framing: same prompt, opposite behavior, architecture under the model is the variable I wasn't controlling. He came at it from MoE vs Dense; I came at it from 4 B vs 12 B and meta-task vs synthesis-task. The two reports compose.

---

🤖 *Honest disclosure: this submission was drafted with AI assistance and edited by the author. The trace numbers, environment specs, and reproduction commands are real and verifiable in the linked repository. The hypotheses are the author's; the fair-witness framing — data without root cause — is deliberate.*
```

`

---

## Where to publish

dev.to → New Post → Editor v1 (markdown) → paste the body above → set title, tags, cover image → **Publish**.

After publish:

1. Add the URL to `reports/promo-assets/launch-tracker.md` "Social posts" table (or trigger a small docs PR — happy to handle this from a future session).
2. Add a self-reply comment on the article pointing at:
   - The internal eval report ([`gemma4-e4b-cognitive-stages-eval.md`](./gemma4-e4b-cognitive-stages-eval.md)) — the source of truth.
   - The Build-track submission — completes the "two halves, same author" arc.
   - Ali Afana's parallel piece — extends the conversation across two writers.
3. Quote-reply from the existing X English thread + LinkedIn post linking the new article. Image: `06-3d-graph.jpg` again (the hero), or `03-chat-graph-paths.jpg` if the post wants to lead with the chat UI.

## Why this submission can win the Write track

| Rubric criterion | This piece |
|---|---|
| **Clarity and depth of explanation** | One controlled experiment, six tabulated trace rows, four named hypotheses, explicit reproduction script |
| **Originality of perspective or insight** | Fair-witness framing — "I have data, not a conclusion" — is rare in dev.to LLM writing. Most pieces commit to a hypothesis early |
| **Practical value to the community** | The five open questions are answerable by anyone running Gemma 4 + Ollama. Any single reply with falsifying data is useful project-wide |
| **Quality of writing** | Inherited from the eval report's voice — short paragraphs, tight tables, no flourish |

Combined with the Build-track piece, the same author appears twice on the challenge with two non-overlapping perspectives on the same model. That itself is a signal of seriousness — defending a model in one piece and documenting its limits in the other is the opposite of a marketing arc.

## Risk-management notes

- The piece is honest about a failure mode of Gemma 4. It is *not* a hit piece — it explicitly preserves credit for what the model does well, and frames the failure as "rich call for community data" rather than "model is bad." This tone is the actual differentiator.
- The mention of Korean text in failed prompts could be misread as a language-equity issue. The body explicitly frames Hypothesis C as one of four possibilities and proposes the test (Korean-key JSON) — that is the right shape for the claim, not bigger.
- Title A leads with five numbers. If dev.to's automatic linting flags it, B or C are safe fallbacks.

## Companion artifacts

- Source eval report (definitive numbers): [`gemma4-e4b-cognitive-stages-eval.md`](./gemma4-e4b-cognitive-stages-eval.md)
- Feedback-routing handover (what to do when replies arrive): [`docs/handovers/v0.3.x-gemma4-feedback-track.md`](../../docs/handovers/v0.3.x-gemma4-feedback-track.md)
- Build-track Companion: [`devto-gemma4-challenge.md`](./devto-gemma4-challenge.md)
- Visual library for cover / inline images: [`screenshots/README.md`](./screenshots/README.md)
- Launch tracker (running log): [`launch-tracker.md`](./launch-tracker.md)# dev.to — Gemma 4 Challenge submission (Write track)

> Drafted: 2026-05-18
> Track: **Write about Gemma 4** ($100 × 5 winners)
> Source material: [`gemma4-e4b-cognitive-stages-eval.md`](./gemma4-e4b-cognitive-stages-eval.md) — internal fair-witness report (PR #307)
> Companion submission: [Build track piece on E4B model choice](./devto-gemma4-challenge.md)
> Submission deadline: 2026-05-24 23:59 PDT
> Winners announced: 2026-06-04
> Tags: `devchallenge`, `gemmachallenge`, `gemma`, `ollama`

## Why a Write-track submission in addition to the Build-track one

The Build-track submission ([`Building a Mini Palantir on gemma4:e4b — 128K Context Lets the Graph Actually Be Graph-RAG`](https://dev.to/hashevolution/building-a-mini-palantir-on-gemma4e4b-128k-context-lets-the-graph-actually-be-graph-rag-33fk)) made an *intentional model choice* argument: 128K context window > parameter count, so E4B was right for the Graph-RAG retrieval-conditioning stage.

That argument held — and it is still the strongest single thing E4B does in this project.

But once the v0.3 Cognitive Middleware Layer started shipping Phase 2 stages (verification, planner, tool router, query rewriter, fact-check), a second pattern showed up that the Build-track piece could not honestly absorb: E4B silently returns empty responses on five of the nine cognitive stages, while the same prompts on Gemma 3 12B succeed end-to-end.

This is the Write-track piece that documents that pattern honestly, without retreating from the Build-track claim. Same author, two articles, two facets of the same model.

The challenge judging rubric for the Write track is:

- Clarity and depth of explanation
- Originality of perspective or insight
- Practical value to the community
- Quality of writing

A fair-witness field report meets all four at once: it shares reproducible numbers, an explicit "I don't know yet" stance on root cause, and a set of open questions that other operators can act on.

---

## Suggested title (pick one)

| # | Title | Why |
|---|---|---|
| **A** | **5 empty responses from `gemma4:e4b`. 4 hypotheses. 0 root cause.** A fair-witness field report from a Graph-RAG production. | ⭐ Strongest hook — number-led, names a tension (no resolution), promises honesty. |
| B | Where Gemma 4 e4b runs out of room: empty responses on meta-reasoning stages | Clearer technical framing, slightly less click-worthy |
| C | Gemma 4 e4b: brilliant at synthesis, silent on meta-reasoning. A field report. | Bridges the strengths and weaknesses in the title itself |

Recommended: **A**. Numbers-led titles outperform on the dev.to feed; the "0 root cause" half signals the writing is honest rather than gloating.

## Cover image

Use [`reports/promo-assets/screenshots/03-chat-graph-paths.jpg`](./screenshots/03-chat-graph-paths.jpg) — the chat-UI screenshot with "그래프 경로 47개 보기" surfaced. It primes the reader for "this writer ships a real Graph-RAG pipeline" before the article gets into the failure mode.

Download URL: `https://github.com/Hashevolution/James-RAG-Evol/blob/main/reports/promo-assets/screenshots/03-chat-graph-paths.jpg?raw=true`

## Tags

```plaintext
devchallenge, gemmachallenge, gemma, ollama
```

`ollama` is the 4th tag (instead of `rag`) — the failure mode plausibly involves the Ollama chat template or stop-token handling, so the Ollama tag's audience is more likely to recognize the pattern.

---

# Submission body (copy-paste into dev.to editor)

`

```markdown
*This is a submission for the [Gemma 4 Challenge: Write about Gemma 4](https://dev.to/challenges/google-gemma-2026-05-06)*

## TL;DR

`gemma4:e4b` (4 B parameters, the "efficient" Gemma 4 build) **excels at long-form natural-language synthesis from a 5 KB retrieved context** in my Graph-RAG project. It also **silently returns empty responses on five short meta-reasoning stages** — query rewrite, plan decomposition, web summary, self-critique, fact-check. Same model, same backend, same `task` parameter. Swapping to `gemma3:12b` made all five succeed without touching a single prompt.

I have data. I do not yet have a root cause. Posting this as a fair-witness field report in case other local-LLM operators have seen the same pattern (or have a prompt-side fix that doesn't require jumping to a 12 B model).

This is a companion to my earlier [Build-track submission](https://dev.to/hashevolution/building-a-mini-palantir-on-gemma4e4b-128k-context-lets-the-graph-actually-be-graph-rag-33fk), which argued for E4B on the basis of its 128 K context window. That argument is still right — for the synthesis stage. The five meta stages are where the 4 B variant runs out of room.

## Setup (reproducible)

### Project

[**PROJECT JAMES v0.3.x**](https://github.com/Hashevolution/James-RAG-Evol) — a local-first Graph-RAG reasoning engine. MIT-licensed, Ollama-only, no cloud LLM dependency. v0.3.0 shipped the Cognitive Middleware Layer architecture; v0.3.x is landing its phases incrementally — verification engine, planner, tool router, query rewriter, fact-check.

Relevant stages of the cognitive layer:

| Stage | Purpose | Prompt shape |
|---|---|---|
| `query_rewrite` | Rewrite the user question for retrieval | Korean/English instruction → JSON `{"rewritten": "..."}` |
| `plan.decompose` | Break a multi-aspect question into ≤ 5 subtasks | Instruction → JSON `{"subtasks": [...]}` |
| `synth.rag` | The actual long-form answer | System prompt + retrieved context (~5 KB) + Korean question → Korean prose answer |
| `synth.web_summary` | Summarize fetched web results | Instruction + web snippets → short Korean summary |
| `reflect.critique` | Critique the draft answer | Draft + instruction → Korean critique text |
| `verify.fact_check` | Audit claims against source docs | Answer + sources + instruction → JSON `{"grounded": bool, "unsupported": [...]}` |

All stages route through one Ollama backend adapter and use the same `JAMES_LLM_MODEL` env var. Whatever model is named, every stage talks to it the same way.

### Environment

- OS: Windows 11, PowerShell
- Ollama: latest mid-May 2026 build
- Models installed locally: `gemma4:e4b` (9.6 GB, ~4 B params), `gemma3:12b` (8.1 GB, ~12 B), plus a few others irrelevant to this report
- All `JAMES_ENABLE_*` cognitive flags set to `1` in the same shell before launching the server

### Test query

```


BlackRock 과 Vanguard 의 ETF 전략 차이를 비교해줘


```

A real Korean retrieval question. Intent classifier picked `retrieval` correctly. Document corpus contains ~10 finance documents matching the topic.

## What I observed with `gemma4:e4b`

Direct quote of the server console (one query, all stages enabled):

| Stage | LLM call type | Latency | Response size | Result |
|---|---|---|---|---|
| INTENT classify | `task=classify` | 9.1 s | **9 chars** ("retrieval") | ✅ OK |
| `query_rewrite` | `task=general` | 2.1 s | **0 chars** | ❌ empty |
| entity extract | `task=extract` | 9.5 s | **452 chars** (JSON of 9 entities) | ✅ OK |
| `synth.web_summary` | `task=general` | 4.0 s | **0 chars** | ❌ empty |
| `synth.rag` | `task=general` | 13.7 s | **2 690 chars** (Korean prose) | ✅ OK |
| `reflect.critique` | `task=general` | 4.2 s | **0 chars** | ❌ empty |
| `verify.fact_check` | `task=general` | 4.3 s | **0 chars** (prompt 4 319 → truncated to 4 000) | ❌ empty |

The empty-response path is taken when Ollama returns HTTP 200 but `response: ""` — the server replied successfully, the model just produced zero tokens. JAMES logs it as `gemma.empty_response`.

### What's striking

- **The 5 empty responses cluster at ~2–4 seconds.** Not a timeout. The per-stage budget is 10–30 s; the model decided it was done.
- **The two successful `task=general` calls** (entity extract: JSON; synth.rag: long Korean prose) **took 9.5 s and 13.7 s.** Same backend, same model, same `task` parameter — only the prompt shape differs.
- **The pattern is consistent across multiple trials.** Run the same query three times back-to-back and the same five stages are empty each time.

## Control — same prompts on `gemma3:12b`

Same query, same flags, no other changes. Single env-var swap, restart server:

| Stage | Latency | Response | Result |
|---|---|---|---|
| `query_rewrite` | 0.91 s | "BlackRock 및 Vanguard의 ETF 투자 전략과 포트폴리오 구성 방식의 차이점을 비교 분석해줘" — meaning-preserved keyword expansion | ✅ |
| `plan.decompose` | 1.33 s | 3 subtasks (BlackRock 조사 / Vanguard 조사 / 비교 분석) | ✅ |
| `synth.rag` | 9.6 s | 2 690-char Korean answer | ✅ |
| `reflect.critique` | 7.98 s | "## 답변 초안 비판적 검토 — 모순 / 사실 오류 …" — coherent meta-critique | ✅ |
| `reflect.revised` | 9.19 s | revised answer based on critique | ✅ |
| `verify.fact_check` | 1.17 s | `{"grounded": true, "unsupported": []}` — valid JSON | ✅ |

Full 9-step trace renders end-to-end. Wall-clock ~39 s. Same prompts. Same wiring. Same backend.

**This is the punchline: nothing changed except the model name.**

## Where Gemma 4 e4b still wins

Staying fair to the model:

- Long-form synthesis from a 5 KB retrieved context — the project's most-frequent stage — handled well at 13.7 s for 2 690 chars of genuinely useful Korean prose.
- JSON entity extraction with a 9-entity schema returned 452 chars of clean JSON at 9.5 s.
- Single-token classification — emit exactly one of seven mode strings — was fine.

The model is not "broken." It ships real Graph-RAG answers. The narrow failure mode is a second class of prompts: **short, structured, meta-instructional**.

## The failure pattern

```


✅ succeeds    long context + free-form Korean prose
✅ succeeds    short instruction + emit 1 token from a finite vocab
✅ succeeds    rich context + emit one JSON object describing the input
❌ empty       short context + emit JSON that critiques / restructures / audits the input


```

The five empty responses share three traits:

1. **The model is asked to act on a model output** — rewrite the user query, critique a draft, audit claims.
2. **The expected output is short and structured** — a few sentences, or a tight JSON object.
3. **The prompt mixes Korean instructions with English JSON schema keys** — e.g. `{"rewritten": "..."}` or `{"grounded": true, "unsupported": []}`.

A natural-language paraphrase (synth.rag) avoids all three. A JSON entity extraction has trait 3 only, and that one passes. The cluster of all three is what seems to silence the model.

## Four working hypotheses

I have data but not a root cause. Four candidate explanations, listed by my own subjective likelihood:

### A. Meta-reasoning capacity at 4 B is the floor

Critique / verify / decomposition prompts ask the model to reason *about* another reasoning artifact. The empirical literature on small open-weights models (Qwen 2.5-3B, Phi-3-mini, Gemma-2-2B, …) consistently shows the meta-reasoning gap is the first capability to drop below ~7 B params, while paraphrase-from-context survives much smaller. If this is right, no prompt-side fix exists for E4B on these stages.

### B. Early stop-token emission on short structured prompts

Ollama returning `response: ""` on a 2–4 s call (well below the timeout) is consistent with the model emitting EOS / `<end_of_turn>` immediately. Possibly the chat template wrapping resembles a completed conversation when the user prompt itself looks like an instruction-only frame with no input data attached.

### C. Korean instruction + English JSON schema confusion

The five failing prompts all mix Korean directive language with English-key JSON output. The two succeeding `task=general` calls don't (entity extract uses Korean prompt → Korean-content JSON; synth.rag is all Korean). Worth testing whether an all-Korean schema (e.g. `{"재작성된_질의": "..."}`) would change anything.

### D. JAMES-side prompt-truncation artifact

The `verify.fact_check` log shows `prompt 4 319자 → 4 000자 축약` — JAMES capped the prompt at 4 000 chars, which likely chopped the closing brace of an embedded JSON example in the system prompt. If true, this is a JAMES bug, not a Gemma 4 bug — but it would only explain `verify.fact_check`, not the other four empty responses.

The report explicitly **does not** advocate for a single hypothesis — that is the work this feedback round is asking the community to fund.

## What I'd love feedback on

If you've used `gemma4:e4b` (or `gemma4:e2b`) and have data points either way, I'd like to know:

1. Have you seen the same "empty response on short structured prompts" pattern? Especially critique-of-a-draft, JSON schema audit, query rewrite.
2. Did a prompt-engineering change rescue it on your setup? Different chat template, different `num_predict`, different temperature, all-one-language prompts, anything else.
3. Does `gemma4:e2b` show the same pattern, or is it specific to E4B?
4. Does the same prompt set behave on `gemma4:31b-dense` / `gemma4:26b-moe` if you have one of those provisioned?
5. Is there a known issue with Ollama + Gemma 4 + JSON-output prompts in your experience?

Project's stance on next steps:

- Default model swap to `gemma3:12b` is already done locally. `gemma4:e4b` stays available — its long-context synthesis is the project's bread-and-butter stage.
- A follow-up PR (option A2) will let operators wire individual cognitive stages to different backends, so E4B can keep `synth.rag` while a heavier model takes the meta stages.
- We will **not** patch JAMES's prompt shapes specifically to coax E4B into responding on these stages until we understand whether the empty response is the model declining, the chat template misfiring, or a JAMES-side truncation bug.

## Reproduction

If you want to reproduce — or, more usefully, to falsify — the report on your own corpus:

```

powershell
# 1. Install JAMES (one-liner, MIT, no cloud)
git clone https://github.com/Hashevolution/James-RAG-Evol
cd James-RAG-Evol
python -m pip install -r requirements.txt

# 2. Make sure the two models are local
ollama pull gemma4:e4b
ollama pull gemma3:12b

# 3. Enable the five cognitive stages
$env:JAMES_ENABLE_QUERY_REWRITE = "1"
$env:JAMES_ENABLE_PLANNER       = "1"
$env:JAMES_ENABLE_REFLECT       = "1"
$env:JAMES_ENABLE_VERIFY        = "1"
$env:JAMES_ENABLE_FACT_CHECK    = "1"

# 4. Test with Gemma 4
$env:JAMES_LLM_MODEL = "gemma4:e4b"
python server_llmwiki.py
# In another shell, send a retrieval query, e.g. the same BlackRock vs Vanguard line above.
python scripts/replay_trace.py --recent

# 5. Control: Gemma 3
$env:JAMES_LLM_MODEL = "gemma3:12b"
python server_llmwiki.py
# Same query, same trace command — all 9 stages succeed


```

If you publish your own numbers — X / GitHub issue / Reddit / dev.to comment — please tag `#JAMES` or open an issue on the [repo](https://github.com/Hashevolution/James-RAG-Evol). I'll link it back to this report.

## A note on the companion piece

This Write-track submission and the [Build-track submission](https://dev.to/hashevolution/building-a-mini-palantir-on-gemma4e4b-128k-context-lets-the-graph-actually-be-graph-rag-33fk) are deliberately contradictory in tone — one defends the model choice, the other documents where the same model falls short on a different class of prompts. Both are honest readings of the same model under different conditions. I think the contradiction is the point: writing about Gemma 4 useful for the community has to include both halves, not just the half that fits the marketing arc.

If you've read [Ali Afana's parallel piece on MoE vs Dense](https://dev.to/alimafana/i-added-three-rules-to-gemma-4-the-moe-searched-the-dense-model-refused-1j18), you'll recognize the framing: same prompt, opposite behavior, architecture under the model is the variable I wasn't controlling. He came at it from MoE vs Dense; I came at it from 4 B vs 12 B and meta-task vs synthesis-task. The two reports compose.

---

🤖 *Honest disclosure: this submission was drafted with AI assistance and edited by the author. The trace numbers, environment specs, and reproduction commands are real and verifiable in the linked repository. The hypotheses are the author's; the fair-witness framing — data without root cause — is deliberate.*
```

`

---

## Where to publish

dev.to → New Post → Editor v1 (markdown) → paste the body above → set title, tags, cover image → **Publish**.

After publish:

1. Add the URL to `reports/promo-assets/launch-tracker.md` "Social posts" table (or trigger a small docs PR — happy to handle this from a future session).
2. Add a self-reply comment on the article pointing at:
   - The internal eval report ([`gemma4-e4b-cognitive-stages-eval.md`](./gemma4-e4b-cognitive-stages-eval.md)) — the source of truth.
   - The Build-track submission — completes the "two halves, same author" arc.
   - Ali Afana's parallel piece — extends the conversation across two writers.
3. Quote-reply from the existing X English thread + LinkedIn post linking the new article. Image: `06-3d-graph.jpg` again (the hero), or `03-chat-graph-paths.jpg` if the post wants to lead with the chat UI.

## Why this submission can win the Write track

| Rubric criterion | This piece |
|---|---|
| **Clarity and depth of explanation** | One controlled experiment, six tabulated trace rows, four named hypotheses, explicit reproduction script |
| **Originality of perspective or insight** | Fair-witness framing — "I have data, not a conclusion" — is rare in dev.to LLM writing. Most pieces commit to a hypothesis early |
| **Practical value to the community** | The five open questions are answerable by anyone running Gemma 4 + Ollama. Any single reply with falsifying data is useful project-wide |
| **Quality of writing** | Inherited from the eval report's voice — short paragraphs, tight tables, no flourish |

Combined with the Build-track piece, the same author appears twice on the challenge with two non-overlapping perspectives on the same model. That itself is a signal of seriousness — defending a model in one piece and documenting its limits in the other is the opposite of a marketing arc.

## Risk-management notes

- The piece is honest about a failure mode of Gemma 4. It is *not* a hit piece — it explicitly preserves credit for what the model does well, and frames the failure as "rich call for community data" rather than "model is bad." This tone is the actual differentiator.
- The mention of Korean text in failed prompts could be misread as a language-equity issue. The body explicitly frames Hypothesis C as one of four possibilities and proposes the test (Korean-key JSON) — that is the right shape for the claim, not bigger.
- Title A leads with five numbers. If dev.to's automatic linting flags it, B or C are safe fallbacks.

## Companion artifacts

- Source eval report (definitive numbers): [`gemma4-e4b-cognitive-stages-eval.md`](./gemma4-e4b-cognitive-stages-eval.md)
- Feedback-routing handover (what to do when replies arrive): [`docs/handovers/v0.3.x-gemma4-feedback-track.md`](../../docs/handovers/v0.3.x-gemma4-feedback-track.md)
- Build-track Companion: [`devto-gemma4-challenge.md`](./devto-gemma4-challenge.md)
- Visual library for cover / inline images: [`screenshots/README.md`](./screenshots/README.md)
- Launch tracker (running log): [`launch-tracker.md`](./launch-tracker.md)

Building a Mini Palantir on gemma4:e4b — 128K Context Lets the Graph Actually Be Graph-RAG

Hashevolution — Wed, 13 May 2026 07:59:18 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

PROJECT JAMES — a security-focused, locally-runnable Graph-RAG knowledge engine in Python, MIT-licensed. Think "Mini Palantir Foundry, but MIT, runs on a laptop, no cloud":

Graph-RAG with 12-type ontology — relations carry semantic meaning, not just vector similarity
3-stage access control — RBAC + ABAC + instruction isolation (vector → graph → output)
Self-evolution scaffold — feedback → patch → 4-Gate validation → auto-rollback on bench regression, with approver_username audit
100% local via Ollama — no cloud LLM dependency
Explicit reasoning paths surfaced in every response

The problem it solves: most local RAG projects pick one of "ontology-aware retrieval", "role-based security", or "self-evolution audit logs". JAMES combines all three because for a 1-person knowledge engine, security and reasoning have to be the same pipeline, not two pipelines glued together — the graph traversal is the security boundary. A confidential entity is never visited for an employee role, so the model never sees it. No jailbreak prompt can leak content it never had in the context.

Palantir® is a registered trademark of Palantir Technologies Inc. PROJECT JAMES is not affiliated. "Mini Palantir" is a descriptive comparison of the ontology-and-audit-log design pattern.

Demo

Self-hosted, alpha v0.2.0. Quick-start (≈ 5 minutes on a laptop):

git clone https://github.com/Hashevolution/James-RAG-Evol
cd James-RAG-Evol
cp .env.example .env   # set JAMES_API_KEY, JAMES_JWT_SECRET (32+ char)
pip install -r requirements.txt
ollama pull gemma4:e4b   # ~2.5 GB
python server_llmwiki.py

→ http://localhost:8000 — chat UI + admin dashboard + 3D ontology graph visualizer.

Key endpoints worth poking at:

POST /query/ — natural-language query, returns answer + traversed graph_paths strings
GET /admin/graph — 3D force-directed ontology visualizer (Three.js)
GET /admin/patch/audit — operator-facing audit log over the patch lifecycle
GET /admin/trace/{trace_id} — full pipeline replay for any query (auth → retrieve → graph → tool → answer → complete stages, with per-stage latency)

Background article with architecture diagram and limitations: dev.to write-up.

Code

Repository: Hashevolution/James-RAG-Evol — MIT

Code-quality and security signals for reviewers:

OpenSSF Best Practices passing badge (Tiered 111%, awarded 2026-05-11)
7 published GitHub Releases through v0.2.0 (Foundation Hardening, 5/6 axes engineering-complete)
ruff.toml enforces F821 / F541 / F401 / F841 on every PR via GitHub Actions
83-item security regression suite (james_security_test.py): injection, path traversal, prompt injection, unsafe deserialization
17-item password regression suite (tests/test_password_bcrypt.py)
bcrypt password storage with transparent SHA-256 → bcrypt migration on first login (PR #173)
GitHub Private Vulnerability Reporting enabled
Module-size gate: no file under core/ exceeds 20 KB

Where Gemma 4 lives in the codebase:

config.py:139 — GEMMA_MODEL = os.environ.get("JAMES_LLM_MODEL", "gemma4:e4b")
llm/router.py — task-aware dispatch (task_type=extract / classify / general / coding / vision); every production call site declares its task
core/reasoning/pipeline.py — RAG retrieval pipeline with explicit graph_paths argument carried to the model
core/security_layer.py::pre_check — risky-coding hard-refuse, byte-identical to prompt-injection block

How I Used Gemma 4

Model choice: E4B (gemma4:e4b)

Three Gemma 4 variants were available. The choice was forced by single-user, laptop-class constraints:

Variant	Considered for	Outcome
31B Dense	Server-grade reasoning depth	❌ Doesn't fit 16 GB RAM; single-user means no throughput need
26B MoE	Long-context advanced reasoning	❌ Expert-routing overhead helps batch workloads; single-user has batch size 1
E4B (4B effective) ⭐	Edge variant: 4B params, native multimodal, 128K context	✅ Fits 8 GB GPU or CPU-only laptop, gives the 128K window I need, supports vision for v0.3 multimodal track
E2B (2B effective)	Smaller still	⚠️ Tested as fallback; reasoning depth too low for graph synthesis at depth 3+

The deciding factor was the 128K context window — not parameter count. Here's why.

Why 128K context matters for Graph-RAG specifically

A typical RAG pipeline retrieves top-k chunks and stuffs them into the prompt. Graph-RAG retrieves chunks and the relations between them — and the relations carry semantic meaning I want the model to reason over, not see as decoration.

A depth-3 query against my 161-entity wiki produces a context like:

[retrieved chunk 1]  (entity A, sensitivity=public)
[retrieved chunk 2]  (entity B, sensitivity=public)
...
[graph_path]  A --[CAUSES]--> X --[REQUIRES]--> Y --[BLOCKED_BY]--> B
[graph_path]  A --[KNOWN_AS]--> A' --[REFERENCES]--> C
[ontology]    relation 'CAUSES'      directed=true  weight=0.85
[ontology]    relation 'BLOCKED_BY'  directed=true  weight=0.92
[instruction] use graph_paths to constrain the answer

For real queries at depth 3 this routinely hits ~40K tokens. With a 32K-window model (most older OSS LLMs), I'd be silently truncating the graph paths — meaning the model defaults to vector-only reasoning and the ontology becomes decoration. With Gemma 4's 128K window the full retrieval result fits in one shot and the model actually reasons over the relation labels.

This is the property I designed the rest of the system around. Without 128K, the "Graph-RAG with ontology" claim collapses into "RAG with extra metadata".

Native function calling → router `task_type`

Gemma 4's native function calling underpins llm/router.py::call_router, which makes every call declare its purpose:

call_router(prompt, task_type="extract",  **kwargs)   # entity extraction
call_router(prompt, task_type="classify", **kwargs)   # intent classification
call_router(prompt, task_type="general",  **kwargs)   # chat answer
call_router(prompt, task_type="coding",   **kwargs)   # code generation

The same router can route task_type=coding to a 32B Coder and task_type=general to Gemma 4 — but default for general reasoning is gemma4:e4b because the 128K window dominates everything else at this scale.

Reasoning mode → security policy

Gemma 4's chain-of-thought reasoning is what makes the risky-coding hard-refuse policy actually usable. The block fires before the LLM is called for clear destructive patterns (regex match in core/security_layer.py::RISKY_CODING_REGEX), but borderline queries pass pre_check and the model itself classifies them:

Query asks how to perform destructive command on a target → refuse, byte-identical block message
Query asks about command syntax (documentation) → answer normally

Without a reasoning-capable model, this distinction collapses into "block everything" (false positives) or "answer everything" (security holes). Gemma 4's reasoning is what threads the needle.

What I didn't use yet

Native multimodal retrieval — Image OCR (Tesseract, EasyOCR) and video ASR (Whisper) are wired as ingestion paths, but treating images/audio as first-class graph citizens during retrieval is the v0.3 deliverable. Gemma 4's native vision is ready and waiting.
31B Dense or MoE for server deployment — JAMES stays single-machine until v1.0 by design (docs/PLATFORM_READINESS.md). When multi-tenancy lands, swapping JAMES_LLM_MODEL=gemma4:31b is a one-env-var change — the router already abstracts it.

One-line summary of the model fit

128K context is what lets Graph-RAG be graph-RAG instead of "RAG with extra metadata". gemma4:e4b is the smallest variant that ships it at a footprint a laptop can hold.

Looking for: adversarial review of the security model, a second user willing to run scripts/bench.py --suite=step7 on their own corpus (that's the v0.2 → v0.3 gate), and critiques of the self-evolution 4-Gate.

GitHub: https://github.com/Hashevolution/James-RAG-Evol
OpenSSF: https://www.bestpractices.dev/projects/12806

🤖 Honest disclosure: this submission was drafted with AI assistance and edited by the author. The codebase, design decisions, model-choice rationale, and limitations described above are real and verifiable in the linked repository.

Building a Mini Palantir: A Local Graph-RAG Engine with Ontology, Security, and Self-Evolution (Alpha)

Hashevolution — Tue, 12 May 2026 08:27:36 +0000

TL;DR
PROJECT JAMES is a security-focused, locally-runnable Graph-RAG knowledge engine in Python. It combines an explicit 12-type ontology, 3-stage access control (RBAC + ABAC + instruction isolation), a self-evolution scaffold with audit log, and 100% local execution via Ollama. MIT-licensed, alpha v0.2.0, OpenSSF Best Practices passing.

Why I built this

If you've ever wanted to point a local LLM at your own wiki, codebase, or document store, you've probably hit the same three walls I did:

Cloud RAG services want everything in their cloud — fine for prototypes, painful for anything sensitive.
Self-hosted RAG frameworks are usually one of: (a) too much infrastructure (Kubernetes-shaped), or (b) too few security primitives (no role separation, no audit trail).
Most Graph-RAG implementations treat the graph as a side feature on top of vectors. The graph rarely participates in the security boundary or the reasoning path.

I wanted something closer to Palantir Foundry's mental model — an explicit ontology, capability-token security, a full audit log — but compressed into something one person can run on a laptop, under MIT, without a cloud account.

That's what PROJECT JAMES is.

Palantir® is a registered trademark of Palantir Technologies Inc. PROJECT JAMES is not affiliated with or endorsed by Palantir. "Mini Palantir" here is a descriptive comparison of the ontology-and-audit-log design pattern, not a product claim.

What's in the box

Five things that rarely show up in the same Python repo:

#	Capability	What it does
1	Graph-RAG with ontology	12 relation types; relations carry semantic meaning beyond vector similarity
2	3-stage security	RBAC + ABAC + Instruction Isolation, applied at vector → graph → output
3	Self-evolution scaffold	feedback signals → patch proposals → 4-Gate validation → auto-rollback on bench regression, all with `approver_username` audit
4	100% local	Ollama-based, no cloud LLM dependency. Gemma `gemma2:2b` works on a laptop
5	Explicit reasoning paths	Every response surfaces the traversed graph paths so you can see why it answered that way

Architecture at a glance

[User query]
     ↓
[Security filter]      ← 31+ injection patterns + risky-coding hard-refuse
     ↓
[Query router]         ← chat / coding / retrieval / web_search
     ↓
[Hybrid search]        ← Vector(60%) + BM25(20%) + keyword(10%) + name(10%)
     ↓
[Graph engine]         ← DFS traversal + confidence pruning + sensitivity gating
     ↓
[Reasoning loop]       ← retrieve → expand → verify
     ↓
[Output filter]        ← PII masking + role-based content filter
     ↓
[Answer + reasoning path]

The graph is not a side index. Every retrieval that reaches the graph engine is gated by the user's role, the entity's sensitivity, and the ontology relation type. Removing the graph would break the security model — they're the same pipeline, not two pipelines glued together.

A typical query lifecycle

# Pseudocode for what happens behind /query/
def answer(query: str, user: User) -> Response:
    # 1. Pre-check: 31+ injection patterns, risky-coding hard-refuse
    if security_layer.pre_check(query) == BLOCK:
        return RESPONSE_BLOCKED  # byte-identical block message

    # 2. Hybrid retrieval — vector + BM25 + keyword + name match
    candidates = hybrid_search(query, top_k=10)

    # 3. Graph expansion — only visit entities the user can read
    paths = graph_engine.expand(
        seed_entities=candidates,
        role=user.role,                # RBAC
        sensitivity_ceiling=user.tier, # ABAC
        max_depth=3,
    )

    # 4. Reason over retrieved context (LLM call via router)
    answer, reasoning_trace = llm.reason(query, paths)

    # 5. Output filter — PII mask, role-based redact
    return output_filter.apply(answer, user.role)

The interesting part is step 3: the graph traversal itself is access-controlled, not just the final output. A confidential entity is never even traversed for an employee user, so the model never sees it. This means no jailbreak prompt can talk the LLM into leaking content it never had in the context.

Security in depth

A few specific behaviors worth calling out:

Hard-refuse for destructive commands

Queries that ask the model to produce filesystem-wide deletion, SQL DROP DATABASE, git reset --hard, etc. trigger a byte-identical block message before the LLM is ever called. The block message is the same string as the prompt-injection block, so an audit consumer cannot distinguish the two externally.

Patterns live in core/security_layer.py::RISKY_CODING_REGEX. Korean scope markers (전체, 모든) are recognized too.

Bcrypt password storage with transparent migration

Passwords are stored as bcrypt$<hash>. Pre-bcrypt SHA-256 hex digests from older deployments are accepted on input only and rewritten to bcrypt on the next successful login — no manual migration needed.

Audit log everywhere

Every approved self-evolution patch is recorded with approver_username, approver_role, approved_at, and approval_method in the patch lifecycle JSONL. There is no auto-deploy path that bypasses this — if you bypass it, your fork stops being JAMES.

Self-evolution scaffold

This is the part that scares people most when I describe it, so let me be precise about what it does and doesn't do:

What it does:

Collects feedback signals from /query/ responses (thumbs-up/down, latency, hallucination flags)
Generates a candidate patch proposal (LLM-assisted)
Validates it through a 4-Gate pipeline:
- Gate 1: Syntactic — parses, imports, no obvious explosions
- Gate 2: Test suite — existing tests still pass
- Gate 3: Bench eval — STEP 7 regression suite stays within tolerance
- Gate 4: Human approval — approver_username required
Applies the patch with a known-good backup
Auto-rollback if Gate 3 detects a post-deploy regression

What it does NOT do:

It does not auto-deploy without approver_username. If you set JAMES_AUTO_APPROVE=1, the server refuses to start unless JAMES_DEV_MODE=1 is also set.
It does not modify trust boundaries (auth, policy, sandbox) without an explicit architecture PR label.
It does not touch security-critical files inside core/security_layer.py or core/policy_engine.py automatically.

The default deployment ships with JAMES_ENABLE_EVOLUTION=0. You have to opt in.

What it's NOT — honest limitations

PROJECT JAMES is alpha. Here's what doesn't work yet:

Real-data validation is the v0.2 → v0.3 gate. The internal STEP 7 suite passes (13 queries, security-block invariants, graph-paths bands), but the next gate is a second user running the bench end-to-end on their own corpus. That's a recruitment problem, not a coding problem, and I'm honest about it.
Multimodal retrieval is v0.3. Video-ASR (Whisper) and image OCR (Tesseract, EasyOCR) are wired and work as ingestion paths, but multimodal retrieval as a first-class graph citizen is the next milestone.
Self-evolution is verified single-user. It works on my machine. It has not been adversarially probed by a second user yet. Don't enable it in production.
Plugin API is v0.3. Domain packs (legal, food, retail, travel) are deliberately blocked until v1.0 — see docs/PLATFORM_READINESS.md for the gate definitions.

Trust signals

External validation that matters more than my self-assessment:

OpenSSF Best Practices passing badge (Tiered 111%, awarded 2026-05-11)
7 published GitHub Releases through v0.2.0 (Foundation Hardening)
Static analysis — ruff F-class rules (F821 + F541 + F401 + F841) enforced on every PR via GitHub Actions
Security tests — 83-item adversarial regression suite (james_security_test.py) covering injection, path traversal, prompt injection, unsafe deserialization; 17-item password regression suite (tests/test_password_bcrypt.py)
Vulnerability disclosure — GitHub Private Vulnerability Reporting enabled; backup channel documented in SECURITY.md
MIT-licensed, with CONTRIBUTING.md test-policy gate

Try it

git clone https://github.com/Hashevolution/James-RAG-Evol
cd James-RAG-Evol

# Configure
cp .env.example .env
# Edit .env — set JAMES_API_KEY, JAMES_JWT_SECRET (32-char random)

# Install (Python 3.11+)
pip install -r requirements.txt

# Pull a model
ollama pull gemma2:2b   # 1.6 GB, runs on a laptop

# Start
python server_llmwiki.py

Then http://localhost:8000.

Where this is going

Short-term roadmap:

v0.2.1: Recruitment for the second-user real-data validation gate
v0.3.0: Plugin API skeleton — core/plugins/base.py with 4 plugin interfaces, JAMES_PLUGINS loader, packs/general/ dogfood, multi-instance JAMES_WORKSPACE
v1.0: Production hardening + first domain packs (legal, retail, etc. only after this gate)

The bigger frame is in docs/PLATFORM_READINESS.md: PROJECT JAMES is a mother platform until v1.0. Domain forks happen after, not before. That's the discipline of the project.

Feedback welcome

I'm specifically looking for:

Adversarial review of the security model — the boundary, the audit log, the hard-refuse policy. If you can break the role separation, please open a private advisory.
A second-user corpus. If you've got a wiki/document store you can point this at and run scripts/bench.py --suite=step7 --check on, I want to know what breaks.
Critiques of the self-evolution scaffold — particularly whether the 4-Gate is enough gating, or whether it needs another stage before Gate 4.

Repo: https://github.com/Hashevolution/James-RAG-Evol
Discussions: GitHub Issues
Security: GitHub Private Vulnerability Reporting (preferred), karu-7@hanmail.net (backup)

If you build something on top of it, I'd love to hear about it.

🤖 Honest disclosure: this article was drafted with AI assistance and edited by the author. The codebase, design decisions, and limitations described here are real and verifiable

Updates (2026-05-12): Submitted to the Gemma 4 Challenge with a follow-up article on the model-choice rationale:
Building a Mini Palantir on gemma4:e4b — 128K Context Lets the Graph Actually Be Graph-RAG.