<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Hashevolution</title>
    <description>The latest articles on Forem by Hashevolution (@hashevolution).</description>
    <link>https://forem.com/hashevolution</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3926644%2Ff977d90a-7772-4986-81c2-143af88dd6ac.png</url>
      <title>Forem: Hashevolution</title>
      <link>https://forem.com/hashevolution</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/hashevolution"/>
    <language>en</language>
    <item>
      <title>Gemma 4 가 갑자기 답을 못 했다 — 외부 협업이 24시간 만에 root cause 찾아낸 이야기</title>
      <dc:creator>Hashevolution</dc:creator>
      <pubDate>Fri, 22 May 2026 08:51:00 +0000</pubDate>
      <link>https://forem.com/hashevolution/gemma-4-ga-gabjagi-dabeul-mos-haessda-oebu-hyeobeobi-24sigan-mane-root-cause-cajanaen-iyagi-5lg</link>
      <guid>https://forem.com/hashevolution/gemma-4-ga-gabjagi-dabeul-mos-haessda-oebu-hyeobeobi-24sigan-mane-root-cause-cajanaen-iyagi-5lg</guid>
      <description>&lt;p&gt;TL;DR (한 문단)&lt;br&gt;
자메스 (PROJECT JAMES, 로컬 Graph-RAG 엔진) 의 4개 인지 단계가 gemma4:e4b 에서 deterministic 하게 빈 응답 을 반환하는 패턴을 2026-05-18 에 fair-witness 보고서로 공개. 며칠 후 Ali Afana (Provia 창업자) 가 본인의 Gemma 4 walk-back article 에서 자메스를 3rd cross-validation context 로 인용. 자메스 측 단일 변수 실험으로 mechanism 확정: 모델이 visible output 첫 token 전에 ~500 token 의 hidden reasoning 을 소비. Cap 이 이 floor 이하면 100% 빈 응답. 4-line code change (PR #399) 머지 완료. 외부 → 내부 cross-validation, mechanism quantification, production fix 모두 24시간 안에.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;시작: "왜 빈 응답이지?"
자메스 (PROJECT JAMES) 는 로컬에서 돌아가는 Graph-RAG 시스템입니다. Ollama 위에 Gemma 4 의 efficient 변형 (gemma4:e4b, 4B 파라미터) 을 기본 모델로 사용합니다.
v0.3 의 cognitive middleware layer 가 phase 2 를 출하하면서 — query rewriter, planner, reflect, verify, fact-check 같은 단계가 추가되었는데 — 이상한 패턴이 발견됐습니다.
코드
같은 모델인데 어떤 stage 는 잘 작동하고, 어떤 stage 는 무조건 0자 응답. gemma3:12b (12B) 로 바꾸면 9/9 통과. 모델 변경만으로 해결되지만, 왜 그런지가 미궁.
2026-05-18 에 fair-witness 보고서를 dev.to 에 게시: 4 가지 가설을 제시하고 답을 단정하지 않음.
가설 A: 4B 모델의 메타-추론 capacity floor
가설 B: 짧은 구조화 prompt 에서 early stop-token
가설 C: 한국어 지시 + 영어 JSON 키 혼합 confusion
가설 D: JAMES 측 prompt truncation 버그
외부 데이터 환영. 외부 reader 가 어느 가설을 falsify 하든 confirm 하든 보고서에 누적.&lt;/li&gt;
&lt;li&gt;Ali Afana 의 walk-back — 다른 deployment context 에서 같은 패턴
며칠 후, Ali Afana 가 dev.to 에서 Gemma 4 의 다른 변형 (31B Dense vs 26B MoE) 에 대해 본인의 분석을 공개했습니다. 첫 주장: "두 architecture 의 동작 차이는 architecture 때문이다."
그런데 Robin Converse (Triava Labs) 가 본인의 sovereign Ollama 환경에서 단순 검증을 했습니다 — max_tokens cap 을 풀고 같은 시나리오를 돌렸더니 18/18 다 통과. 그녀가 Ali 에게 던진 질문: "managed Gemini 쪽에서 cap 을 풀면 어떻게 되나요?"
Ali 가 단일 변수 재실험: max_tokens 400 → 4096. Dense 12/12, MoE 12/12 — 모두 회복.
그 결과로 Ali 는 본인의 article 을 공개적으로 walk-back: "차이는 architecture 가 아니라 token cap 이었다."
walk-back article 에서 그는 자메스의 production default 를 3rd cross-validation context 로 명시:
Source
Context
Test
Result
Robin Converse
sovereign Ollama, uncapped
6 시나리오 × 3 온도
18/18
Ali Afana
managed Gemini, 400 → 4096
12 calls
12/12 회복
JAMES (자메스)
local Ollama, default 200/400/400/400
5/6 stages
빈 응답&lt;/li&gt;
&lt;li&gt;자메스 측 검증: V3' 단일 변수 실험
자메스 코드를 점검해 보니 충격적인 일치:
Stage
자메스 default
Ali 의 failing cap
query_rewriter.py:46
200
400
planner.py:43
400
400 ← 정확 일치
reflect.py:54 (CRITIQUE)
400
400 ← 정확 일치
verify.py:69 (FACT_CHECK)
400
400 ← 정확 일치
자메스의 production default 가 정확히 Ali 의 failing threshold 였습니다. 우리도 모르고 있었던 일치.
이제 단일 변수로 검증할 차례. V3' 라고 명명한 사내 실험:
V3'.a — query_rewriter stage (n=10 per cap)
코드
이게 의미하는 바: gemma4:e4b 가 visible output 첫 token 직전까지 ~500 token 의 hidden reasoning 을 소비합니다. Cap 이 이 floor 이하면 모델은 visible byte 하나도 emit 못 함. 100% deterministic empty.
Ali 가 본인 article 에서 "starving the reasoning layer" 라고 비유한 패턴을 토큰 수준에서 정량 측정 한 셈입니다.
V3'.b — planner stage (n=10 per cap)
코드
Cross-stage 진단:
Metric
V3'.a (cap 200)
V3'.b (cap 400)
해석
Default-cap latency
2.1s
4.3s
Cap 의 2배 → 시간도 정확히 2배
4096-cap latency
5.3s
7.1s
+1.8s for planner 의 추가 reasoning
200 cap 의 latency 가 2.1s, 400 cap 이 4.3s — 선형 비례. 즉 ~500 token 의 reasoning floor 가 stage 와 무관한 모델 수준 특성. Cap 만큼의 시간을 선형으로 소진하다가 visible output 1 byte 도 emit 못 한 채 종료.
가설 공간 정리
✅ B (token budget): 확정 — mechanism 까지 측정
❌ A (4B floor): 사실상 기각 — 같은 모델이 cap 만 풀면 정상 작동
🤷 C, D: 변동 없음 (검증 안 됨)
⏸ E (-tag 후처리): cross-stage 일관성으로 약화&lt;/li&gt;
&lt;li&gt;The Fix — 4 line code change
PR #399: 4 개 stage 의 DEFAULT_MAX_TOKENS 상수를 4096 으로 bump.
Diff
각 변경에는 stable-WHY 코멘트 추가 — 미래 maintainer 가 "왜 4096 인가" 를 코드만 보고 이해할 수 있도록.
STEP 7 bench (13개 query 회귀 테스트, gemma3:12b 권장 모델 기준): 13/13 baseline tolerance 내. 변경의 비파괴성 확인.
Squash-merge 머지 완료.&lt;/li&gt;
&lt;li&gt;향후 진행
즉시 (1-3일)
V3'.c (reflect.critique) + V3'.d (verify.fact_check) post-merge validation. 같은 protocol, 같은 모델, 같은 default 400 — 동일 패턴 재현 강한 prior. Unexpected drift 면 single-line revert
synth.web_summary 의 inline max_tokens=300 (core/reasoning/pipeline_synth.py:141) 도 ~500 floor 아래 — 별도 PR 로 fix
중기 (Mid-June)
Ali Afana 의 Gemini backend implementation PR 도착 예정 (Track 1 Provider contract 가 이미 설정해놓은 surface)
Track 3 STEP 7 cross-experiment: 자메스 (Ollama local) + Ali (Gemini API) 의 swap eval — 같은 wiki corpus 에서 두 backend 의 retrieval-conditioning + synthesis layer 비교
장기 (Joint piece, Track 5)
Robin Converse 의 temperature sweep post + 자메스 cross-experiment + Ali Gemini 결과를 3-name joint piece 로 출판
가제: "3 contexts, 2 architectures, 1 mechanism" — 세 운영 환경에서 같은 mechanism 을 관찰한 협업 사례&lt;/li&gt;
&lt;li&gt;이 이야기에서 배운 것
기술적으로
LLM 의 "빈 응답" 은 종종 모델 capability 가 아니라 budget 부족. Cap 이 hidden reasoning floor 이하면 visible output 전에 cap 도달.
gemma4:e4b 의 hidden-to-visible token ratio 는 약 5-6:1 (단일 stage 측정). 이건 모델 수준 특성으로 보임.
한 prompt 가 require 하는 reasoning budget 을 측정 없이 cap 잡으면 deterministic 실패의 함정. 운영 default 는 모델별로 floor 측정 후 결정해야 함.
협업으로
외부 사람이 본인의 hypothesis 를 honest 하게 walk-back 하는 것 이 협업의 가장 큰 가치. Ali 가 본인 article 을 공개적으로 정정한 결과로, 세 명의 deployment context 가 24시간 안에 cross-validation 완성.
공개 fair-witness 보고서 의 가치 — 외부 사람이 그것을 reference 로 인용하면서 연구의 chain 이 형성됨. 닫힌 연구 노트에서는 불가능한 형식.
Single-variable test 의 힘. Ali 가 본인의 가설을 검증한 방식, 우리가 그것을 자메스에서 검증한 방식, 모두 한 변수만 바꿔서 다른 모든 것 고정. Mechanism 격리에 필수.
링크
자메스 fair-witness 보고서 (2026-05-18): &lt;a href="https://dev.to/hashevolution/5-empty-responses-from-gemma4e4b-4-hypotheses-0-root-cause-1ggd"&gt;https://dev.to/hashevolution/5-empty-responses-from-gemma4e4b-4-hypotheses-0-root-cause-1ggd&lt;/a&gt;
Ali Afana walk-back article: &lt;a href="https://dev.to/alimafana/i-raised-gemma-4s-token-cap-the-dense-model-stopped-refusing-2gf3"&gt;https://dev.to/alimafana/i-raised-gemma-4s-token-cap-the-dense-model-stopped-refusing-2gf3&lt;/a&gt;
자메스 GitHub repo: &lt;a href="https://github.com/Hashevolution/James-RAG-Evol" rel="noopener noreferrer"&gt;https://github.com/Hashevolution/James-RAG-Evol&lt;/a&gt;
PR #399 (cap fix 머지됨): &lt;a href="https://github.com/Hashevolution/James-RAG-Evol/pull/399" rel="noopener noreferrer"&gt;https://github.com/Hashevolution/James-RAG-Evol/pull/399&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>gemmachallenge</category>
      <category>ai</category>
      <category>productivity</category>
      <category>opensource</category>
    </item>
    <item>
      <title>5 empty responses from gemma4:e4b. 4 hypotheses. 0 root cause.</title>
      <dc:creator>Hashevolution</dc:creator>
      <pubDate>Mon, 18 May 2026 08:49:44 +0000</pubDate>
      <link>https://forem.com/hashevolution/5-empty-responses-from-gemma4e4b-4-hypotheses-0-root-cause-1ggd</link>
      <guid>https://forem.com/hashevolution/5-empty-responses-from-gemma4e4b-4-hypotheses-0-root-cause-1ggd</guid>
      <description>&lt;h1&gt;
  
  
  dev.to — Gemma 4 Challenge submission (Write track)
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;Drafted: 2026-05-18&lt;br&gt;
Track: &lt;strong&gt;Write about Gemma 4&lt;/strong&gt; ($100 × 5 winners)&lt;br&gt;
Source material: &lt;a href="//./gemma4-e4b-cognitive-stages-eval.md"&gt;&lt;code&gt;gemma4-e4b-cognitive-stages-eval.md&lt;/code&gt;&lt;/a&gt; — internal fair-witness report (PR #307)&lt;br&gt;
Companion submission: &lt;a href="//./devto-gemma4-challenge.md"&gt;Build track piece on E4B model choice&lt;/a&gt;&lt;br&gt;
Submission deadline: 2026-05-24 23:59 PDT&lt;br&gt;
Winners announced: 2026-06-04&lt;br&gt;
Tags: &lt;code&gt;devchallenge&lt;/code&gt;, &lt;code&gt;gemmachallenge&lt;/code&gt;, &lt;code&gt;gemma&lt;/code&gt;, &lt;code&gt;ollama&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Why a Write-track submission in addition to the Build-track one
&lt;/h2&gt;

&lt;p&gt;The Build-track submission (&lt;a href="https://dev.to/hashevolution/building-a-mini-palantir-on-gemma4e4b-128k-context-lets-the-graph-actually-be-graph-rag-33fk"&gt;&lt;code&gt;Building a Mini Palantir on gemma4:e4b — 128K Context Lets the Graph Actually Be Graph-RAG&lt;/code&gt;&lt;/a&gt;) made an &lt;em&gt;intentional model choice&lt;/em&gt; argument: 128K context window &amp;gt; parameter count, so E4B was right for the Graph-RAG retrieval-conditioning stage.&lt;/p&gt;

&lt;p&gt;That argument held — and it is still the strongest single thing E4B does in this project.&lt;/p&gt;

&lt;p&gt;But once the v0.3 Cognitive Middleware Layer started shipping Phase 2 stages (verification, planner, tool router, query rewriter, fact-check), a second pattern showed up that the Build-track piece could not honestly absorb: E4B silently returns empty responses on five of the nine cognitive stages, while the same prompts on Gemma 3 12B succeed end-to-end.&lt;/p&gt;

&lt;p&gt;This is the Write-track piece that documents that pattern honestly, without retreating from the Build-track claim. Same author, two articles, two facets of the same model.&lt;/p&gt;

&lt;p&gt;The challenge judging rubric for the Write track is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clarity and depth of explanation&lt;/li&gt;
&lt;li&gt;Originality of perspective or insight&lt;/li&gt;
&lt;li&gt;Practical value to the community&lt;/li&gt;
&lt;li&gt;Quality of writing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A fair-witness field report meets all four at once: it shares reproducible numbers, an explicit "I don't know yet" stance on root cause, and a set of open questions that other operators can act on.&lt;/p&gt;




&lt;h2&gt;
  
  
  Suggested title (pick one)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Title&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;A&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;5 empty responses from &lt;code&gt;gemma4:e4b&lt;/code&gt;. 4 hypotheses. 0 root cause.&lt;/strong&gt; A fair-witness field report from a Graph-RAG production.&lt;/td&gt;
&lt;td&gt;⭐ Strongest hook — number-led, names a tension (no resolution), promises honesty.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B&lt;/td&gt;
&lt;td&gt;Where Gemma 4 e4b runs out of room: empty responses on meta-reasoning stages&lt;/td&gt;
&lt;td&gt;Clearer technical framing, slightly less click-worthy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;Gemma 4 e4b: brilliant at synthesis, silent on meta-reasoning. A field report.&lt;/td&gt;
&lt;td&gt;Bridges the strengths and weaknesses in the title itself&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Recommended: &lt;strong&gt;A&lt;/strong&gt;. Numbers-led titles outperform on the dev.to feed; the "0 root cause" half signals the writing is honest rather than gloating.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cover image
&lt;/h2&gt;

&lt;p&gt;Use &lt;a href="//./screenshots/03-chat-graph-paths.jpg"&gt;&lt;code&gt;reports/promo-assets/screenshots/03-chat-graph-paths.jpg&lt;/code&gt;&lt;/a&gt; — the chat-UI screenshot with "그래프 경로 47개 보기" surfaced. It primes the reader for "this writer ships a real Graph-RAG pipeline" before the article gets into the failure mode.&lt;/p&gt;

&lt;p&gt;Download URL: &lt;code&gt;https://github.com/Hashevolution/James-RAG-Evol/blob/main/reports/promo-assets/screenshots/03-chat-graph-paths.jpg?raw=true&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Tags
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;devchallenge, gemmachallenge, gemma, ollama
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;ollama&lt;/code&gt; is the 4th tag (instead of &lt;code&gt;rag&lt;/code&gt;) — the failure mode plausibly involves the Ollama chat template or stop-token handling, so the Ollama tag's audience is more likely to recognize the pattern.&lt;/p&gt;




&lt;h1&gt;
  
  
  Submission body (copy-paste into dev.to editor)
&lt;/h1&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="ge"&gt;*This is a submission for the [Gemma 4 Challenge: Write about Gemma 4](https://dev.to/challenges/google-gemma-2026-05-06)*&lt;/span&gt;

&lt;span class="gu"&gt;## TL;DR&lt;/span&gt;

&lt;span class="sb"&gt;`gemma4:e4b`&lt;/span&gt; (4 B parameters, the "efficient" Gemma 4 build) &lt;span class="gs"&gt;**excels at long-form natural-language synthesis from a 5 KB retrieved context**&lt;/span&gt; in my Graph-RAG project. It also &lt;span class="gs"&gt;**silently returns empty responses on five short meta-reasoning stages**&lt;/span&gt; — query rewrite, plan decomposition, web summary, self-critique, fact-check. Same model, same backend, same &lt;span class="sb"&gt;`task`&lt;/span&gt; parameter. Swapping to &lt;span class="sb"&gt;`gemma3:12b`&lt;/span&gt; made all five succeed without touching a single prompt.

I have data. I do not yet have a root cause. Posting this as a fair-witness field report in case other local-LLM operators have seen the same pattern (or have a prompt-side fix that doesn't require jumping to a 12 B model).

This is a companion to my earlier &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Build-track submission&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="sx"&gt;https://dev.to/hashevolution/building-a-mini-palantir-on-gemma4e4b-128k-context-lets-the-graph-actually-be-graph-rag-33fk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;, which argued for E4B on the basis of its 128 K context window. That argument is still right — for the synthesis stage. The five meta stages are where the 4 B variant runs out of room.

&lt;span class="gu"&gt;## Setup (reproducible)&lt;/span&gt;

&lt;span class="gu"&gt;### Project&lt;/span&gt;

&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;**PROJECT JAMES v0.3.x**&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="sx"&gt;https://github.com/Hashevolution/James-RAG-Evol&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; — a local-first Graph-RAG reasoning engine. MIT-licensed, Ollama-only, no cloud LLM dependency. v0.3.0 shipped the Cognitive Middleware Layer architecture; v0.3.x is landing its phases incrementally — verification engine, planner, tool router, query rewriter, fact-check.

Relevant stages of the cognitive layer:

| Stage | Purpose | Prompt shape |
|---|---|---|
| &lt;span class="sb"&gt;`query_rewrite`&lt;/span&gt; | Rewrite the user question for retrieval | Korean/English instruction → JSON &lt;span class="sb"&gt;`{"rewritten": "..."}`&lt;/span&gt; |
| &lt;span class="sb"&gt;`plan.decompose`&lt;/span&gt; | Break a multi-aspect question into ≤ 5 subtasks | Instruction → JSON &lt;span class="sb"&gt;`{"subtasks": [...]}`&lt;/span&gt; |
| &lt;span class="sb"&gt;`synth.rag`&lt;/span&gt; | The actual long-form answer | System prompt + retrieved context (~5 KB) + Korean question → Korean prose answer |
| &lt;span class="sb"&gt;`synth.web_summary`&lt;/span&gt; | Summarize fetched web results | Instruction + web snippets → short Korean summary |
| &lt;span class="sb"&gt;`reflect.critique`&lt;/span&gt; | Critique the draft answer | Draft + instruction → Korean critique text |
| &lt;span class="sb"&gt;`verify.fact_check`&lt;/span&gt; | Audit claims against source docs | Answer + sources + instruction → JSON &lt;span class="sb"&gt;`{"grounded": bool, "unsupported": [...]}`&lt;/span&gt; |

All stages route through one Ollama backend adapter and use the same &lt;span class="sb"&gt;`JAMES_LLM_MODEL`&lt;/span&gt; env var. Whatever model is named, every stage talks to it the same way.

&lt;span class="gu"&gt;### Environment&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; OS: Windows 11, PowerShell
&lt;span class="p"&gt;-&lt;/span&gt; Ollama: latest mid-May 2026 build
&lt;span class="p"&gt;-&lt;/span&gt; Models installed locally: &lt;span class="sb"&gt;`gemma4:e4b`&lt;/span&gt; (9.6 GB, ~4 B params), &lt;span class="sb"&gt;`gemma3:12b`&lt;/span&gt; (8.1 GB, ~12 B), plus a few others irrelevant to this report
&lt;span class="p"&gt;-&lt;/span&gt; All &lt;span class="sb"&gt;`JAMES_ENABLE_*`&lt;/span&gt; cognitive flags set to &lt;span class="sb"&gt;`1`&lt;/span&gt; in the same shell before launching the server

&lt;span class="gu"&gt;### Test query&lt;/span&gt;

&lt;span class="p"&gt;```&lt;/span&gt;&lt;span class="nl"&gt;
&lt;/span&gt;

BlackRock 과 Vanguard 의 ETF 전략 차이를 비교해줘


&lt;span class="p"&gt;```&lt;/span&gt;

A real Korean retrieval question. Intent classifier picked &lt;span class="sb"&gt;`retrieval`&lt;/span&gt; correctly. Document corpus contains ~10 finance documents matching the topic.

&lt;span class="gu"&gt;## What I observed with `gemma4:e4b`&lt;/span&gt;

Direct quote of the server console (one query, all stages enabled):

| Stage | LLM call type | Latency | Response size | Result |
|---|---|---|---|---|
| INTENT classify | &lt;span class="sb"&gt;`task=classify`&lt;/span&gt; | 9.1 s | &lt;span class="gs"&gt;**9 chars**&lt;/span&gt; ("retrieval") | ✅ OK |
| &lt;span class="sb"&gt;`query_rewrite`&lt;/span&gt; | &lt;span class="sb"&gt;`task=general`&lt;/span&gt; | 2.1 s | &lt;span class="gs"&gt;**0 chars**&lt;/span&gt; | ❌ empty |
| entity extract | &lt;span class="sb"&gt;`task=extract`&lt;/span&gt; | 9.5 s | &lt;span class="gs"&gt;**452 chars**&lt;/span&gt; (JSON of 9 entities) | ✅ OK |
| &lt;span class="sb"&gt;`synth.web_summary`&lt;/span&gt; | &lt;span class="sb"&gt;`task=general`&lt;/span&gt; | 4.0 s | &lt;span class="gs"&gt;**0 chars**&lt;/span&gt; | ❌ empty |
| &lt;span class="sb"&gt;`synth.rag`&lt;/span&gt; | &lt;span class="sb"&gt;`task=general`&lt;/span&gt; | 13.7 s | &lt;span class="gs"&gt;**2 690 chars**&lt;/span&gt; (Korean prose) | ✅ OK |
| &lt;span class="sb"&gt;`reflect.critique`&lt;/span&gt; | &lt;span class="sb"&gt;`task=general`&lt;/span&gt; | 4.2 s | &lt;span class="gs"&gt;**0 chars**&lt;/span&gt; | ❌ empty |
| &lt;span class="sb"&gt;`verify.fact_check`&lt;/span&gt; | &lt;span class="sb"&gt;`task=general`&lt;/span&gt; | 4.3 s | &lt;span class="gs"&gt;**0 chars**&lt;/span&gt; (prompt 4 319 → truncated to 4 000) | ❌ empty |

The empty-response path is taken when Ollama returns HTTP 200 but &lt;span class="sb"&gt;`response: ""`&lt;/span&gt; — the server replied successfully, the model just produced zero tokens. JAMES logs it as &lt;span class="sb"&gt;`gemma.empty_response`&lt;/span&gt;.

&lt;span class="gu"&gt;### What's striking&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; &lt;span class="gs"&gt;**The 5 empty responses cluster at ~2–4 seconds.**&lt;/span&gt; Not a timeout. The per-stage budget is 10–30 s; the model decided it was done.
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**The two successful `task=general` calls**&lt;/span&gt; (entity extract: JSON; synth.rag: long Korean prose) &lt;span class="gs"&gt;**took 9.5 s and 13.7 s.**&lt;/span&gt; Same backend, same model, same &lt;span class="sb"&gt;`task`&lt;/span&gt; parameter — only the prompt shape differs.
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**The pattern is consistent across multiple trials.**&lt;/span&gt; Run the same query three times back-to-back and the same five stages are empty each time.

&lt;span class="gu"&gt;## Control — same prompts on `gemma3:12b`&lt;/span&gt;

Same query, same flags, no other changes. Single env-var swap, restart server:

| Stage | Latency | Response | Result |
|---|---|---|---|
| &lt;span class="sb"&gt;`query_rewrite`&lt;/span&gt; | 0.91 s | "BlackRock 및 Vanguard의 ETF 투자 전략과 포트폴리오 구성 방식의 차이점을 비교 분석해줘" — meaning-preserved keyword expansion | ✅ |
| &lt;span class="sb"&gt;`plan.decompose`&lt;/span&gt; | 1.33 s | 3 subtasks (BlackRock 조사 / Vanguard 조사 / 비교 분석) | ✅ |
| &lt;span class="sb"&gt;`synth.rag`&lt;/span&gt; | 9.6 s | 2 690-char Korean answer | ✅ |
| &lt;span class="sb"&gt;`reflect.critique`&lt;/span&gt; | 7.98 s | "## 답변 초안 비판적 검토 — 모순 / 사실 오류 …" — coherent meta-critique | ✅ |
| &lt;span class="sb"&gt;`reflect.revised`&lt;/span&gt; | 9.19 s | revised answer based on critique | ✅ |
| &lt;span class="sb"&gt;`verify.fact_check`&lt;/span&gt; | 1.17 s | &lt;span class="sb"&gt;`{"grounded": true, "unsupported": []}`&lt;/span&gt; — valid JSON | ✅ |

Full 9-step trace renders end-to-end. Wall-clock ~39 s. Same prompts. Same wiring. Same backend.

&lt;span class="gs"&gt;**This is the punchline: nothing changed except the model name.**&lt;/span&gt;

&lt;span class="gu"&gt;## Where Gemma 4 e4b still wins&lt;/span&gt;

Staying fair to the model:
&lt;span class="p"&gt;
-&lt;/span&gt; Long-form synthesis from a 5 KB retrieved context — the project's most-frequent stage — handled well at 13.7 s for 2 690 chars of genuinely useful Korean prose.
&lt;span class="p"&gt;-&lt;/span&gt; JSON entity extraction with a 9-entity schema returned 452 chars of clean JSON at 9.5 s.
&lt;span class="p"&gt;-&lt;/span&gt; Single-token classification — emit exactly one of seven mode strings — was fine.

The model is not "broken." It ships real Graph-RAG answers. The narrow failure mode is a second class of prompts: &lt;span class="gs"&gt;**short, structured, meta-instructional**&lt;/span&gt;.

&lt;span class="gu"&gt;## The failure pattern&lt;/span&gt;

&lt;span class="p"&gt;```&lt;/span&gt;&lt;span class="nl"&gt;
&lt;/span&gt;

✅ succeeds    long context + free-form Korean prose
✅ succeeds    short instruction + emit 1 token from a finite vocab
✅ succeeds    rich context + emit one JSON object describing the input
❌ empty       short context + emit JSON that critiques / restructures / audits the input


&lt;span class="p"&gt;```&lt;/span&gt;

The five empty responses share three traits:
&lt;span class="p"&gt;
1.&lt;/span&gt; &lt;span class="gs"&gt;**The model is asked to act on a model output**&lt;/span&gt; — rewrite the user query, critique a draft, audit claims.
&lt;span class="p"&gt;2.&lt;/span&gt; &lt;span class="gs"&gt;**The expected output is short and structured**&lt;/span&gt; — a few sentences, or a tight JSON object.
&lt;span class="p"&gt;3.&lt;/span&gt; &lt;span class="gs"&gt;**The prompt mixes Korean instructions with English JSON schema keys**&lt;/span&gt; — e.g. &lt;span class="sb"&gt;`{"rewritten": "..."}`&lt;/span&gt; or &lt;span class="sb"&gt;`{"grounded": true, "unsupported": []}`&lt;/span&gt;.

A natural-language paraphrase (synth.rag) avoids all three. A JSON entity extraction has trait 3 only, and that one passes. The cluster of all three is what seems to silence the model.

&lt;span class="gu"&gt;## Four working hypotheses&lt;/span&gt;

I have data but not a root cause. Four candidate explanations, listed by my own subjective likelihood:

&lt;span class="gu"&gt;### A. Meta-reasoning capacity at 4 B is the floor&lt;/span&gt;

Critique / verify / decomposition prompts ask the model to reason &lt;span class="ge"&gt;*about*&lt;/span&gt; another reasoning artifact. The empirical literature on small open-weights models (Qwen 2.5-3B, Phi-3-mini, Gemma-2-2B, …) consistently shows the meta-reasoning gap is the first capability to drop below ~7 B params, while paraphrase-from-context survives much smaller. If this is right, no prompt-side fix exists for E4B on these stages.

&lt;span class="gu"&gt;### B. Early stop-token emission on short structured prompts&lt;/span&gt;

Ollama returning &lt;span class="sb"&gt;`response: ""`&lt;/span&gt; on a 2–4 s call (well below the timeout) is consistent with the model emitting EOS / &lt;span class="sb"&gt;`&amp;lt;end_of_turn&amp;gt;`&lt;/span&gt; immediately. Possibly the chat template wrapping resembles a completed conversation when the user prompt itself looks like an instruction-only frame with no input data attached.

&lt;span class="gu"&gt;### C. Korean instruction + English JSON schema confusion&lt;/span&gt;

The five failing prompts all mix Korean directive language with English-key JSON output. The two succeeding &lt;span class="sb"&gt;`task=general`&lt;/span&gt; calls don't (entity extract uses Korean prompt → Korean-content JSON; synth.rag is all Korean). Worth testing whether an all-Korean schema (e.g. &lt;span class="sb"&gt;`{"재작성된_질의": "..."}`&lt;/span&gt;) would change anything.

&lt;span class="gu"&gt;### D. JAMES-side prompt-truncation artifact&lt;/span&gt;

The &lt;span class="sb"&gt;`verify.fact_check`&lt;/span&gt; log shows &lt;span class="sb"&gt;`prompt 4 319자 → 4 000자 축약`&lt;/span&gt; — JAMES capped the prompt at 4 000 chars, which likely chopped the closing brace of an embedded JSON example in the system prompt. If true, this is a JAMES bug, not a Gemma 4 bug — but it would only explain &lt;span class="sb"&gt;`verify.fact_check`&lt;/span&gt;, not the other four empty responses.

The report explicitly &lt;span class="gs"&gt;**does not**&lt;/span&gt; advocate for a single hypothesis — that is the work this feedback round is asking the community to fund.

&lt;span class="gu"&gt;## What I'd love feedback on&lt;/span&gt;

If you've used &lt;span class="sb"&gt;`gemma4:e4b`&lt;/span&gt; (or &lt;span class="sb"&gt;`gemma4:e2b`&lt;/span&gt;) and have data points either way, I'd like to know:
&lt;span class="p"&gt;
1.&lt;/span&gt; Have you seen the same "empty response on short structured prompts" pattern? Especially critique-of-a-draft, JSON schema audit, query rewrite.
&lt;span class="p"&gt;2.&lt;/span&gt; Did a prompt-engineering change rescue it on your setup? Different chat template, different &lt;span class="sb"&gt;`num_predict`&lt;/span&gt;, different temperature, all-one-language prompts, anything else.
&lt;span class="p"&gt;3.&lt;/span&gt; Does &lt;span class="sb"&gt;`gemma4:e2b`&lt;/span&gt; show the same pattern, or is it specific to E4B?
&lt;span class="p"&gt;4.&lt;/span&gt; Does the same prompt set behave on &lt;span class="sb"&gt;`gemma4:31b-dense`&lt;/span&gt; / &lt;span class="sb"&gt;`gemma4:26b-moe`&lt;/span&gt; if you have one of those provisioned?
&lt;span class="p"&gt;5.&lt;/span&gt; Is there a known issue with Ollama + Gemma 4 + JSON-output prompts in your experience?

Project's stance on next steps:
&lt;span class="p"&gt;
-&lt;/span&gt; Default model swap to &lt;span class="sb"&gt;`gemma3:12b`&lt;/span&gt; is already done locally. &lt;span class="sb"&gt;`gemma4:e4b`&lt;/span&gt; stays available — its long-context synthesis is the project's bread-and-butter stage.
&lt;span class="p"&gt;-&lt;/span&gt; A follow-up PR (option A2) will let operators wire individual cognitive stages to different backends, so E4B can keep &lt;span class="sb"&gt;`synth.rag`&lt;/span&gt; while a heavier model takes the meta stages.
&lt;span class="p"&gt;-&lt;/span&gt; We will &lt;span class="gs"&gt;**not**&lt;/span&gt; patch JAMES's prompt shapes specifically to coax E4B into responding on these stages until we understand whether the empty response is the model declining, the chat template misfiring, or a JAMES-side truncation bug.

&lt;span class="gu"&gt;## Reproduction&lt;/span&gt;

If you want to reproduce — or, more usefully, to falsify — the report on your own corpus:

&lt;span class="p"&gt;```&lt;/span&gt;&lt;span class="nl"&gt;
&lt;/span&gt;
powershell
# 1. Install JAMES (one-liner, MIT, no cloud)
git clone https://github.com/Hashevolution/James-RAG-Evol
cd James-RAG-Evol
python -m pip install -r requirements.txt

# 2. Make sure the two models are local
ollama pull gemma4:e4b
ollama pull gemma3:12b

# 3. Enable the five cognitive stages
$env:JAMES_ENABLE_QUERY_REWRITE = "1"
$env:JAMES_ENABLE_PLANNER       = "1"
$env:JAMES_ENABLE_REFLECT       = "1"
$env:JAMES_ENABLE_VERIFY        = "1"
$env:JAMES_ENABLE_FACT_CHECK    = "1"

# 4. Test with Gemma 4
$env:JAMES_LLM_MODEL = "gemma4:e4b"
python server_llmwiki.py
# In another shell, send a retrieval query, e.g. the same BlackRock vs Vanguard line above.
python scripts/replay_trace.py --recent

# 5. Control: Gemma 3
$env:JAMES_LLM_MODEL = "gemma3:12b"
python server_llmwiki.py
# Same query, same trace command — all 9 stages succeed


&lt;span class="p"&gt;```&lt;/span&gt;

If you publish your own numbers — X / GitHub issue / Reddit / dev.to comment — please tag &lt;span class="sb"&gt;`#JAMES`&lt;/span&gt; or open an issue on the &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;repo&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="sx"&gt;https://github.com/Hashevolution/James-RAG-Evol&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;. I'll link it back to this report.

&lt;span class="gu"&gt;## A note on the companion piece&lt;/span&gt;

This Write-track submission and the &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Build-track submission&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="sx"&gt;https://dev.to/hashevolution/building-a-mini-palantir-on-gemma4e4b-128k-context-lets-the-graph-actually-be-graph-rag-33fk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; are deliberately contradictory in tone — one defends the model choice, the other documents where the same model falls short on a different class of prompts. Both are honest readings of the same model under different conditions. I think the contradiction is the point: writing about Gemma 4 useful for the community has to include both halves, not just the half that fits the marketing arc.

If you've read &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Ali Afana's parallel piece on MoE vs Dense&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="sx"&gt;https://dev.to/alimafana/i-added-three-rules-to-gemma-4-the-moe-searched-the-dense-model-refused-1j18&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;, you'll recognize the framing: same prompt, opposite behavior, architecture under the model is the variable I wasn't controlling. He came at it from MoE vs Dense; I came at it from 4 B vs 12 B and meta-task vs synthesis-task. The two reports compose.
&lt;span class="p"&gt;
---
&lt;/span&gt;
🤖 &lt;span class="ge"&gt;*Honest disclosure: this submission was drafted with AI assistance and edited by the author. The trace numbers, environment specs, and reproduction commands are real and verifiable in the linked repository. The hypotheses are the author's; the fair-witness framing — data without root cause — is deliberate.*&lt;/span&gt;
&lt;span class="p"&gt;```&lt;/span&gt;&lt;span class="nl"&gt;
&lt;/span&gt;
`

---

## Where to publish

dev.to → New Post → Editor v1 (markdown) → paste the body above → set title, tags, cover image → **Publish**.

After publish:

1. Add the URL to `reports/promo-assets/launch-tracker.md` "Social posts" table (or trigger a small docs PR — happy to handle this from a future session).
2. Add a self-reply comment on the article pointing at:
   - The internal eval report ([`gemma4-e4b-cognitive-stages-eval.md`](./gemma4-e4b-cognitive-stages-eval.md)) — the source of truth.
   - The Build-track submission — completes the "two halves, same author" arc.
   - Ali Afana's parallel piece — extends the conversation across two writers.
3. Quote-reply from the existing X English thread + LinkedIn post linking the new article. Image: `06-3d-graph.jpg` again (the hero), or `03-chat-graph-paths.jpg` if the post wants to lead with the chat UI.

## Why this submission can win the Write track

| Rubric criterion | This piece |
|---|---|
| **Clarity and depth of explanation** | One controlled experiment, six tabulated trace rows, four named hypotheses, explicit reproduction script |
| **Originality of perspective or insight** | Fair-witness framing — "I have data, not a conclusion" — is rare in dev.to LLM writing. Most pieces commit to a hypothesis early |
| **Practical value to the community** | The five open questions are answerable by anyone running Gemma 4 + Ollama. Any single reply with falsifying data is useful project-wide |
| **Quality of writing** | Inherited from the eval report's voice — short paragraphs, tight tables, no flourish |

Combined with the Build-track piece, the same author appears twice on the challenge with two non-overlapping perspectives on the same model. That itself is a signal of seriousness — defending a model in one piece and documenting its limits in the other is the opposite of a marketing arc.

## Risk-management notes

- The piece is honest about a failure mode of Gemma 4. It is *not* a hit piece — it explicitly preserves credit for what the model does well, and frames the failure as "rich call for community data" rather than "model is bad." This tone is the actual differentiator.
- The mention of Korean text in failed prompts could be misread as a language-equity issue. The body explicitly frames Hypothesis C as one of four possibilities and proposes the test (Korean-key JSON) — that is the right shape for the claim, not bigger.
- Title A leads with five numbers. If dev.to's automatic linting flags it, B or C are safe fallbacks.

## Companion artifacts

- Source eval report (definitive numbers): [`gemma4-e4b-cognitive-stages-eval.md`](./gemma4-e4b-cognitive-stages-eval.md)
- Feedback-routing handover (what to do when replies arrive): [`docs/handovers/v0.3.x-gemma4-feedback-track.md`](../../docs/handovers/v0.3.x-gemma4-feedback-track.md)
- Build-track Companion: [`devto-gemma4-challenge.md`](./devto-gemma4-challenge.md)
- Visual library for cover / inline images: [`screenshots/README.md`](./screenshots/README.md)
- Launch tracker (running log): [`launch-tracker.md`](./launch-tracker.md)# dev.to — Gemma 4 Challenge submission (Write track)

&amp;gt; Drafted: 2026-05-18
&amp;gt; Track: **Write about Gemma 4** ($100 × 5 winners)
&amp;gt; Source material: [`gemma4-e4b-cognitive-stages-eval.md`](./gemma4-e4b-cognitive-stages-eval.md) — internal fair-witness report (PR #307)
&amp;gt; Companion submission: [Build track piece on E4B model choice](./devto-gemma4-challenge.md)
&amp;gt; Submission deadline: 2026-05-24 23:59 PDT
&amp;gt; Winners announced: 2026-06-04
&amp;gt; Tags: `devchallenge`, `gemmachallenge`, `gemma`, `ollama`

## Why a Write-track submission in addition to the Build-track one

The Build-track submission ([`Building a Mini Palantir on gemma4:e4b — 128K Context Lets the Graph Actually Be Graph-RAG`](https://dev.to/hashevolution/building-a-mini-palantir-on-gemma4e4b-128k-context-lets-the-graph-actually-be-graph-rag-33fk)) made an *intentional model choice* argument: 128K context window &amp;gt; parameter count, so E4B was right for the Graph-RAG retrieval-conditioning stage.

That argument held — and it is still the strongest single thing E4B does in this project.

But once the v0.3 Cognitive Middleware Layer started shipping Phase 2 stages (verification, planner, tool router, query rewriter, fact-check), a second pattern showed up that the Build-track piece could not honestly absorb: E4B silently returns empty responses on five of the nine cognitive stages, while the same prompts on Gemma 3 12B succeed end-to-end.

This is the Write-track piece that documents that pattern honestly, without retreating from the Build-track claim. Same author, two articles, two facets of the same model.

The challenge judging rubric for the Write track is:

- Clarity and depth of explanation
- Originality of perspective or insight
- Practical value to the community
- Quality of writing

A fair-witness field report meets all four at once: it shares reproducible numbers, an explicit "I don't know yet" stance on root cause, and a set of open questions that other operators can act on.

---

## Suggested title (pick one)

| # | Title | Why |
|---|---|---|
| **A** | **5 empty responses from `gemma4:e4b`. 4 hypotheses. 0 root cause.** A fair-witness field report from a Graph-RAG production. | ⭐ Strongest hook — number-led, names a tension (no resolution), promises honesty. |
| B | Where Gemma 4 e4b runs out of room: empty responses on meta-reasoning stages | Clearer technical framing, slightly less click-worthy |
| C | Gemma 4 e4b: brilliant at synthesis, silent on meta-reasoning. A field report. | Bridges the strengths and weaknesses in the title itself |

Recommended: **A**. Numbers-led titles outperform on the dev.to feed; the "0 root cause" half signals the writing is honest rather than gloating.

## Cover image

Use [`reports/promo-assets/screenshots/03-chat-graph-paths.jpg`](./screenshots/03-chat-graph-paths.jpg) — the chat-UI screenshot with "그래프 경로 47개 보기" surfaced. It primes the reader for "this writer ships a real Graph-RAG pipeline" before the article gets into the failure mode.

Download URL: `https://github.com/Hashevolution/James-RAG-Evol/blob/main/reports/promo-assets/screenshots/03-chat-graph-paths.jpg?raw=true`

## Tags

```plaintext
devchallenge, gemmachallenge, gemma, ollama
```

`ollama` is the 4th tag (instead of `rag`) — the failure mode plausibly involves the Ollama chat template or stop-token handling, so the Ollama tag's audience is more likely to recognize the pattern.

---

# Submission body (copy-paste into dev.to editor)

`

&lt;span class="p"&gt;```&lt;/span&gt;markdown
&lt;span class="ge"&gt;*This is a submission for the [Gemma 4 Challenge: Write about Gemma 4](https://dev.to/challenges/google-gemma-2026-05-06)*&lt;/span&gt;

&lt;span class="gu"&gt;## TL;DR&lt;/span&gt;

&lt;span class="sb"&gt;`gemma4:e4b`&lt;/span&gt; (4 B parameters, the "efficient" Gemma 4 build) &lt;span class="gs"&gt;**excels at long-form natural-language synthesis from a 5 KB retrieved context**&lt;/span&gt; in my Graph-RAG project. It also &lt;span class="gs"&gt;**silently returns empty responses on five short meta-reasoning stages**&lt;/span&gt; — query rewrite, plan decomposition, web summary, self-critique, fact-check. Same model, same backend, same &lt;span class="sb"&gt;`task`&lt;/span&gt; parameter. Swapping to &lt;span class="sb"&gt;`gemma3:12b`&lt;/span&gt; made all five succeed without touching a single prompt.

I have data. I do not yet have a root cause. Posting this as a fair-witness field report in case other local-LLM operators have seen the same pattern (or have a prompt-side fix that doesn't require jumping to a 12 B model).

This is a companion to my earlier &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Build-track submission&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="sx"&gt;https://dev.to/hashevolution/building-a-mini-palantir-on-gemma4e4b-128k-context-lets-the-graph-actually-be-graph-rag-33fk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;, which argued for E4B on the basis of its 128 K context window. That argument is still right — for the synthesis stage. The five meta stages are where the 4 B variant runs out of room.

&lt;span class="gu"&gt;## Setup (reproducible)&lt;/span&gt;

&lt;span class="gu"&gt;### Project&lt;/span&gt;

&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;**PROJECT JAMES v0.3.x**&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="sx"&gt;https://github.com/Hashevolution/James-RAG-Evol&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; — a local-first Graph-RAG reasoning engine. MIT-licensed, Ollama-only, no cloud LLM dependency. v0.3.0 shipped the Cognitive Middleware Layer architecture; v0.3.x is landing its phases incrementally — verification engine, planner, tool router, query rewriter, fact-check.

Relevant stages of the cognitive layer:

| Stage | Purpose | Prompt shape |
|---|---|---|
| &lt;span class="sb"&gt;`query_rewrite`&lt;/span&gt; | Rewrite the user question for retrieval | Korean/English instruction → JSON &lt;span class="sb"&gt;`{"rewritten": "..."}`&lt;/span&gt; |
| &lt;span class="sb"&gt;`plan.decompose`&lt;/span&gt; | Break a multi-aspect question into ≤ 5 subtasks | Instruction → JSON &lt;span class="sb"&gt;`{"subtasks": [...]}`&lt;/span&gt; |
| &lt;span class="sb"&gt;`synth.rag`&lt;/span&gt; | The actual long-form answer | System prompt + retrieved context (~5 KB) + Korean question → Korean prose answer |
| &lt;span class="sb"&gt;`synth.web_summary`&lt;/span&gt; | Summarize fetched web results | Instruction + web snippets → short Korean summary |
| &lt;span class="sb"&gt;`reflect.critique`&lt;/span&gt; | Critique the draft answer | Draft + instruction → Korean critique text |
| &lt;span class="sb"&gt;`verify.fact_check`&lt;/span&gt; | Audit claims against source docs | Answer + sources + instruction → JSON &lt;span class="sb"&gt;`{"grounded": bool, "unsupported": [...]}`&lt;/span&gt; |

All stages route through one Ollama backend adapter and use the same &lt;span class="sb"&gt;`JAMES_LLM_MODEL`&lt;/span&gt; env var. Whatever model is named, every stage talks to it the same way.

&lt;span class="gu"&gt;### Environment&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; OS: Windows 11, PowerShell
&lt;span class="p"&gt;-&lt;/span&gt; Ollama: latest mid-May 2026 build
&lt;span class="p"&gt;-&lt;/span&gt; Models installed locally: &lt;span class="sb"&gt;`gemma4:e4b`&lt;/span&gt; (9.6 GB, ~4 B params), &lt;span class="sb"&gt;`gemma3:12b`&lt;/span&gt; (8.1 GB, ~12 B), plus a few others irrelevant to this report
&lt;span class="p"&gt;-&lt;/span&gt; All &lt;span class="sb"&gt;`JAMES_ENABLE_*`&lt;/span&gt; cognitive flags set to &lt;span class="sb"&gt;`1`&lt;/span&gt; in the same shell before launching the server

&lt;span class="gu"&gt;### Test query&lt;/span&gt;

&lt;span class="p"&gt;```&lt;/span&gt;&lt;span class="nl"&gt;
&lt;/span&gt;

BlackRock 과 Vanguard 의 ETF 전략 차이를 비교해줘


&lt;span class="p"&gt;```&lt;/span&gt;

A real Korean retrieval question. Intent classifier picked &lt;span class="sb"&gt;`retrieval`&lt;/span&gt; correctly. Document corpus contains ~10 finance documents matching the topic.

&lt;span class="gu"&gt;## What I observed with `gemma4:e4b`&lt;/span&gt;

Direct quote of the server console (one query, all stages enabled):

| Stage | LLM call type | Latency | Response size | Result |
|---|---|---|---|---|
| INTENT classify | &lt;span class="sb"&gt;`task=classify`&lt;/span&gt; | 9.1 s | &lt;span class="gs"&gt;**9 chars**&lt;/span&gt; ("retrieval") | ✅ OK |
| &lt;span class="sb"&gt;`query_rewrite`&lt;/span&gt; | &lt;span class="sb"&gt;`task=general`&lt;/span&gt; | 2.1 s | &lt;span class="gs"&gt;**0 chars**&lt;/span&gt; | ❌ empty |
| entity extract | &lt;span class="sb"&gt;`task=extract`&lt;/span&gt; | 9.5 s | &lt;span class="gs"&gt;**452 chars**&lt;/span&gt; (JSON of 9 entities) | ✅ OK |
| &lt;span class="sb"&gt;`synth.web_summary`&lt;/span&gt; | &lt;span class="sb"&gt;`task=general`&lt;/span&gt; | 4.0 s | &lt;span class="gs"&gt;**0 chars**&lt;/span&gt; | ❌ empty |
| &lt;span class="sb"&gt;`synth.rag`&lt;/span&gt; | &lt;span class="sb"&gt;`task=general`&lt;/span&gt; | 13.7 s | &lt;span class="gs"&gt;**2 690 chars**&lt;/span&gt; (Korean prose) | ✅ OK |
| &lt;span class="sb"&gt;`reflect.critique`&lt;/span&gt; | &lt;span class="sb"&gt;`task=general`&lt;/span&gt; | 4.2 s | &lt;span class="gs"&gt;**0 chars**&lt;/span&gt; | ❌ empty |
| &lt;span class="sb"&gt;`verify.fact_check`&lt;/span&gt; | &lt;span class="sb"&gt;`task=general`&lt;/span&gt; | 4.3 s | &lt;span class="gs"&gt;**0 chars**&lt;/span&gt; (prompt 4 319 → truncated to 4 000) | ❌ empty |

The empty-response path is taken when Ollama returns HTTP 200 but &lt;span class="sb"&gt;`response: ""`&lt;/span&gt; — the server replied successfully, the model just produced zero tokens. JAMES logs it as &lt;span class="sb"&gt;`gemma.empty_response`&lt;/span&gt;.

&lt;span class="gu"&gt;### What's striking&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; &lt;span class="gs"&gt;**The 5 empty responses cluster at ~2–4 seconds.**&lt;/span&gt; Not a timeout. The per-stage budget is 10–30 s; the model decided it was done.
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**The two successful `task=general` calls**&lt;/span&gt; (entity extract: JSON; synth.rag: long Korean prose) &lt;span class="gs"&gt;**took 9.5 s and 13.7 s.**&lt;/span&gt; Same backend, same model, same &lt;span class="sb"&gt;`task`&lt;/span&gt; parameter — only the prompt shape differs.
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**The pattern is consistent across multiple trials.**&lt;/span&gt; Run the same query three times back-to-back and the same five stages are empty each time.

&lt;span class="gu"&gt;## Control — same prompts on `gemma3:12b`&lt;/span&gt;

Same query, same flags, no other changes. Single env-var swap, restart server:

| Stage | Latency | Response | Result |
|---|---|---|---|
| &lt;span class="sb"&gt;`query_rewrite`&lt;/span&gt; | 0.91 s | "BlackRock 및 Vanguard의 ETF 투자 전략과 포트폴리오 구성 방식의 차이점을 비교 분석해줘" — meaning-preserved keyword expansion | ✅ |
| &lt;span class="sb"&gt;`plan.decompose`&lt;/span&gt; | 1.33 s | 3 subtasks (BlackRock 조사 / Vanguard 조사 / 비교 분석) | ✅ |
| &lt;span class="sb"&gt;`synth.rag`&lt;/span&gt; | 9.6 s | 2 690-char Korean answer | ✅ |
| &lt;span class="sb"&gt;`reflect.critique`&lt;/span&gt; | 7.98 s | "## 답변 초안 비판적 검토 — 모순 / 사실 오류 …" — coherent meta-critique | ✅ |
| &lt;span class="sb"&gt;`reflect.revised`&lt;/span&gt; | 9.19 s | revised answer based on critique | ✅ |
| &lt;span class="sb"&gt;`verify.fact_check`&lt;/span&gt; | 1.17 s | &lt;span class="sb"&gt;`{"grounded": true, "unsupported": []}`&lt;/span&gt; — valid JSON | ✅ |

Full 9-step trace renders end-to-end. Wall-clock ~39 s. Same prompts. Same wiring. Same backend.

&lt;span class="gs"&gt;**This is the punchline: nothing changed except the model name.**&lt;/span&gt;

&lt;span class="gu"&gt;## Where Gemma 4 e4b still wins&lt;/span&gt;

Staying fair to the model:
&lt;span class="p"&gt;
-&lt;/span&gt; Long-form synthesis from a 5 KB retrieved context — the project's most-frequent stage — handled well at 13.7 s for 2 690 chars of genuinely useful Korean prose.
&lt;span class="p"&gt;-&lt;/span&gt; JSON entity extraction with a 9-entity schema returned 452 chars of clean JSON at 9.5 s.
&lt;span class="p"&gt;-&lt;/span&gt; Single-token classification — emit exactly one of seven mode strings — was fine.

The model is not "broken." It ships real Graph-RAG answers. The narrow failure mode is a second class of prompts: &lt;span class="gs"&gt;**short, structured, meta-instructional**&lt;/span&gt;.

&lt;span class="gu"&gt;## The failure pattern&lt;/span&gt;

&lt;span class="p"&gt;```&lt;/span&gt;&lt;span class="nl"&gt;
&lt;/span&gt;

✅ succeeds    long context + free-form Korean prose
✅ succeeds    short instruction + emit 1 token from a finite vocab
✅ succeeds    rich context + emit one JSON object describing the input
❌ empty       short context + emit JSON that critiques / restructures / audits the input


&lt;span class="p"&gt;```&lt;/span&gt;

The five empty responses share three traits:
&lt;span class="p"&gt;
1.&lt;/span&gt; &lt;span class="gs"&gt;**The model is asked to act on a model output**&lt;/span&gt; — rewrite the user query, critique a draft, audit claims.
&lt;span class="p"&gt;2.&lt;/span&gt; &lt;span class="gs"&gt;**The expected output is short and structured**&lt;/span&gt; — a few sentences, or a tight JSON object.
&lt;span class="p"&gt;3.&lt;/span&gt; &lt;span class="gs"&gt;**The prompt mixes Korean instructions with English JSON schema keys**&lt;/span&gt; — e.g. &lt;span class="sb"&gt;`{"rewritten": "..."}`&lt;/span&gt; or &lt;span class="sb"&gt;`{"grounded": true, "unsupported": []}`&lt;/span&gt;.

A natural-language paraphrase (synth.rag) avoids all three. A JSON entity extraction has trait 3 only, and that one passes. The cluster of all three is what seems to silence the model.

&lt;span class="gu"&gt;## Four working hypotheses&lt;/span&gt;

I have data but not a root cause. Four candidate explanations, listed by my own subjective likelihood:

&lt;span class="gu"&gt;### A. Meta-reasoning capacity at 4 B is the floor&lt;/span&gt;

Critique / verify / decomposition prompts ask the model to reason &lt;span class="ge"&gt;*about*&lt;/span&gt; another reasoning artifact. The empirical literature on small open-weights models (Qwen 2.5-3B, Phi-3-mini, Gemma-2-2B, …) consistently shows the meta-reasoning gap is the first capability to drop below ~7 B params, while paraphrase-from-context survives much smaller. If this is right, no prompt-side fix exists for E4B on these stages.

&lt;span class="gu"&gt;### B. Early stop-token emission on short structured prompts&lt;/span&gt;

Ollama returning &lt;span class="sb"&gt;`response: ""`&lt;/span&gt; on a 2–4 s call (well below the timeout) is consistent with the model emitting EOS / &lt;span class="sb"&gt;`&amp;lt;end_of_turn&amp;gt;`&lt;/span&gt; immediately. Possibly the chat template wrapping resembles a completed conversation when the user prompt itself looks like an instruction-only frame with no input data attached.

&lt;span class="gu"&gt;### C. Korean instruction + English JSON schema confusion&lt;/span&gt;

The five failing prompts all mix Korean directive language with English-key JSON output. The two succeeding &lt;span class="sb"&gt;`task=general`&lt;/span&gt; calls don't (entity extract uses Korean prompt → Korean-content JSON; synth.rag is all Korean). Worth testing whether an all-Korean schema (e.g. &lt;span class="sb"&gt;`{"재작성된_질의": "..."}`&lt;/span&gt;) would change anything.

&lt;span class="gu"&gt;### D. JAMES-side prompt-truncation artifact&lt;/span&gt;

The &lt;span class="sb"&gt;`verify.fact_check`&lt;/span&gt; log shows &lt;span class="sb"&gt;`prompt 4 319자 → 4 000자 축약`&lt;/span&gt; — JAMES capped the prompt at 4 000 chars, which likely chopped the closing brace of an embedded JSON example in the system prompt. If true, this is a JAMES bug, not a Gemma 4 bug — but it would only explain &lt;span class="sb"&gt;`verify.fact_check`&lt;/span&gt;, not the other four empty responses.

The report explicitly &lt;span class="gs"&gt;**does not**&lt;/span&gt; advocate for a single hypothesis — that is the work this feedback round is asking the community to fund.

&lt;span class="gu"&gt;## What I'd love feedback on&lt;/span&gt;

If you've used &lt;span class="sb"&gt;`gemma4:e4b`&lt;/span&gt; (or &lt;span class="sb"&gt;`gemma4:e2b`&lt;/span&gt;) and have data points either way, I'd like to know:
&lt;span class="p"&gt;
1.&lt;/span&gt; Have you seen the same "empty response on short structured prompts" pattern? Especially critique-of-a-draft, JSON schema audit, query rewrite.
&lt;span class="p"&gt;2.&lt;/span&gt; Did a prompt-engineering change rescue it on your setup? Different chat template, different &lt;span class="sb"&gt;`num_predict`&lt;/span&gt;, different temperature, all-one-language prompts, anything else.
&lt;span class="p"&gt;3.&lt;/span&gt; Does &lt;span class="sb"&gt;`gemma4:e2b`&lt;/span&gt; show the same pattern, or is it specific to E4B?
&lt;span class="p"&gt;4.&lt;/span&gt; Does the same prompt set behave on &lt;span class="sb"&gt;`gemma4:31b-dense`&lt;/span&gt; / &lt;span class="sb"&gt;`gemma4:26b-moe`&lt;/span&gt; if you have one of those provisioned?
&lt;span class="p"&gt;5.&lt;/span&gt; Is there a known issue with Ollama + Gemma 4 + JSON-output prompts in your experience?

Project's stance on next steps:
&lt;span class="p"&gt;
-&lt;/span&gt; Default model swap to &lt;span class="sb"&gt;`gemma3:12b`&lt;/span&gt; is already done locally. &lt;span class="sb"&gt;`gemma4:e4b`&lt;/span&gt; stays available — its long-context synthesis is the project's bread-and-butter stage.
&lt;span class="p"&gt;-&lt;/span&gt; A follow-up PR (option A2) will let operators wire individual cognitive stages to different backends, so E4B can keep &lt;span class="sb"&gt;`synth.rag`&lt;/span&gt; while a heavier model takes the meta stages.
&lt;span class="p"&gt;-&lt;/span&gt; We will &lt;span class="gs"&gt;**not**&lt;/span&gt; patch JAMES's prompt shapes specifically to coax E4B into responding on these stages until we understand whether the empty response is the model declining, the chat template misfiring, or a JAMES-side truncation bug.

&lt;span class="gu"&gt;## Reproduction&lt;/span&gt;

If you want to reproduce — or, more usefully, to falsify — the report on your own corpus:

&lt;span class="p"&gt;```&lt;/span&gt;&lt;span class="nl"&gt;
&lt;/span&gt;
powershell
# 1. Install JAMES (one-liner, MIT, no cloud)
git clone https://github.com/Hashevolution/James-RAG-Evol
cd James-RAG-Evol
python -m pip install -r requirements.txt

# 2. Make sure the two models are local
ollama pull gemma4:e4b
ollama pull gemma3:12b

# 3. Enable the five cognitive stages
$env:JAMES_ENABLE_QUERY_REWRITE = "1"
$env:JAMES_ENABLE_PLANNER       = "1"
$env:JAMES_ENABLE_REFLECT       = "1"
$env:JAMES_ENABLE_VERIFY        = "1"
$env:JAMES_ENABLE_FACT_CHECK    = "1"

# 4. Test with Gemma 4
$env:JAMES_LLM_MODEL = "gemma4:e4b"
python server_llmwiki.py
# In another shell, send a retrieval query, e.g. the same BlackRock vs Vanguard line above.
python scripts/replay_trace.py --recent

# 5. Control: Gemma 3
$env:JAMES_LLM_MODEL = "gemma3:12b"
python server_llmwiki.py
# Same query, same trace command — all 9 stages succeed


&lt;span class="p"&gt;```&lt;/span&gt;

If you publish your own numbers — X / GitHub issue / Reddit / dev.to comment — please tag &lt;span class="sb"&gt;`#JAMES`&lt;/span&gt; or open an issue on the &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;repo&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="sx"&gt;https://github.com/Hashevolution/James-RAG-Evol&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;. I'll link it back to this report.

&lt;span class="gu"&gt;## A note on the companion piece&lt;/span&gt;

This Write-track submission and the &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Build-track submission&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="sx"&gt;https://dev.to/hashevolution/building-a-mini-palantir-on-gemma4e4b-128k-context-lets-the-graph-actually-be-graph-rag-33fk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; are deliberately contradictory in tone — one defends the model choice, the other documents where the same model falls short on a different class of prompts. Both are honest readings of the same model under different conditions. I think the contradiction is the point: writing about Gemma 4 useful for the community has to include both halves, not just the half that fits the marketing arc.

If you've read &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Ali Afana's parallel piece on MoE vs Dense&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="sx"&gt;https://dev.to/alimafana/i-added-three-rules-to-gemma-4-the-moe-searched-the-dense-model-refused-1j18&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;, you'll recognize the framing: same prompt, opposite behavior, architecture under the model is the variable I wasn't controlling. He came at it from MoE vs Dense; I came at it from 4 B vs 12 B and meta-task vs synthesis-task. The two reports compose.
&lt;span class="p"&gt;
---
&lt;/span&gt;
🤖 &lt;span class="ge"&gt;*Honest disclosure: this submission was drafted with AI assistance and edited by the author. The trace numbers, environment specs, and reproduction commands are real and verifiable in the linked repository. The hypotheses are the author's; the fair-witness framing — data without root cause — is deliberate.*&lt;/span&gt;
&lt;span class="p"&gt;```&lt;/span&gt;&lt;span class="nl"&gt;
&lt;/span&gt;
`

---

## Where to publish

dev.to → New Post → Editor v1 (markdown) → paste the body above → set title, tags, cover image → **Publish**.

After publish:

1. Add the URL to `reports/promo-assets/launch-tracker.md` "Social posts" table (or trigger a small docs PR — happy to handle this from a future session).
2. Add a self-reply comment on the article pointing at:
   - The internal eval report ([`gemma4-e4b-cognitive-stages-eval.md`](./gemma4-e4b-cognitive-stages-eval.md)) — the source of truth.
   - The Build-track submission — completes the "two halves, same author" arc.
   - Ali Afana's parallel piece — extends the conversation across two writers.
3. Quote-reply from the existing X English thread + LinkedIn post linking the new article. Image: `06-3d-graph.jpg` again (the hero), or `03-chat-graph-paths.jpg` if the post wants to lead with the chat UI.

## Why this submission can win the Write track

| Rubric criterion | This piece |
|---|---|
| **Clarity and depth of explanation** | One controlled experiment, six tabulated trace rows, four named hypotheses, explicit reproduction script |
| **Originality of perspective or insight** | Fair-witness framing — "I have data, not a conclusion" — is rare in dev.to LLM writing. Most pieces commit to a hypothesis early |
| **Practical value to the community** | The five open questions are answerable by anyone running Gemma 4 + Ollama. Any single reply with falsifying data is useful project-wide |
| **Quality of writing** | Inherited from the eval report's voice — short paragraphs, tight tables, no flourish |

Combined with the Build-track piece, the same author appears twice on the challenge with two non-overlapping perspectives on the same model. That itself is a signal of seriousness — defending a model in one piece and documenting its limits in the other is the opposite of a marketing arc.

## Risk-management notes

- The piece is honest about a failure mode of Gemma 4. It is *not* a hit piece — it explicitly preserves credit for what the model does well, and frames the failure as "rich call for community data" rather than "model is bad." This tone is the actual differentiator.
- The mention of Korean text in failed prompts could be misread as a language-equity issue. The body explicitly frames Hypothesis C as one of four possibilities and proposes the test (Korean-key JSON) — that is the right shape for the claim, not bigger.
- Title A leads with five numbers. If dev.to's automatic linting flags it, B or C are safe fallbacks.

## Companion artifacts

- Source eval report (definitive numbers): [`gemma4-e4b-cognitive-stages-eval.md`](./gemma4-e4b-cognitive-stages-eval.md)
- Feedback-routing handover (what to do when replies arrive): [`docs/handovers/v0.3.x-gemma4-feedback-track.md`](../../docs/handovers/v0.3.x-gemma4-feedback-track.md)
- Build-track Companion: [`devto-gemma4-challenge.md`](./devto-gemma4-challenge.md)
- Visual library for cover / inline images: [`screenshots/README.md`](./screenshots/README.md)
- Launch tracker (running log): [`launch-tracker.md`](./launch-tracker.md)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
      <category>ollama</category>
    </item>
    <item>
      <title>Building a Mini Palantir on gemma4:e4b — 128K Context Lets the Graph Actually Be Graph-RAG</title>
      <dc:creator>Hashevolution</dc:creator>
      <pubDate>Wed, 13 May 2026 07:59:18 +0000</pubDate>
      <link>https://forem.com/hashevolution/building-a-mini-palantir-on-gemma4e4b-128k-context-lets-the-graph-actually-be-graph-rag-33fk</link>
      <guid>https://forem.com/hashevolution/building-a-mini-palantir-on-gemma4e4b-128k-context-lets-the-graph-actually-be-graph-rag-33fk</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Build with Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;PROJECT JAMES&lt;/strong&gt; — a security-focused, locally-runnable &lt;strong&gt;Graph-RAG knowledge engine&lt;/strong&gt; in Python, MIT-licensed. Think "Mini Palantir Foundry, but MIT, runs on a laptop, no cloud":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Graph-RAG with 12-type ontology&lt;/strong&gt; — relations carry semantic meaning, not just vector similarity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3-stage access control&lt;/strong&gt; — RBAC + ABAC + instruction isolation (vector → graph → output)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-evolution scaffold&lt;/strong&gt; — feedback → patch → 4-Gate validation → auto-rollback on bench regression, with &lt;code&gt;approver_username&lt;/code&gt; audit&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;100% local&lt;/strong&gt; via Ollama — no cloud LLM dependency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explicit reasoning paths&lt;/strong&gt; surfaced in every response&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The problem it solves: most local RAG projects pick &lt;em&gt;one&lt;/em&gt; of "ontology-aware retrieval", "role-based security", or "self-evolution audit logs". JAMES combines all three because for a 1-person knowledge engine, &lt;strong&gt;security and reasoning have to be the same pipeline, not two pipelines glued together&lt;/strong&gt; — the graph traversal &lt;em&gt;is&lt;/em&gt; the security boundary. A &lt;code&gt;confidential&lt;/code&gt; entity is never visited for an &lt;code&gt;employee&lt;/code&gt; role, so the model never sees it. No jailbreak prompt can leak content it never had in the context.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Palantir® is a registered trademark of Palantir Technologies Inc. PROJECT JAMES is not affiliated. "Mini Palantir" is a descriptive comparison of the ontology-and-audit-log design pattern.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;Self-hosted, alpha v0.2.0. Quick-start (≈ 5 minutes on a laptop):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Hashevolution/James-RAG-Evol
&lt;span class="nb"&gt;cd &lt;/span&gt;James-RAG-Evol
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env   &lt;span class="c"&gt;# set JAMES_API_KEY, JAMES_JWT_SECRET (32+ char)&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
ollama pull gemma4:e4b   &lt;span class="c"&gt;# ~2.5 GB&lt;/span&gt;
python server_llmwiki.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ &lt;code&gt;http://localhost:8000&lt;/code&gt; — chat UI + admin dashboard + 3D ontology graph visualizer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key endpoints worth poking at:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;POST /query/&lt;/code&gt; — natural-language query, returns answer + traversed &lt;code&gt;graph_paths&lt;/code&gt; strings&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GET /admin/graph&lt;/code&gt; — 3D force-directed ontology visualizer (Three.js)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GET /admin/patch/audit&lt;/code&gt; — operator-facing audit log over the patch lifecycle&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GET /admin/trace/{trace_id}&lt;/code&gt; — full pipeline replay for any query (auth → retrieve → graph → tool → answer → complete stages, with per-stage latency)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Background article&lt;/strong&gt; with architecture diagram and limitations: &lt;a href="https://dev.to/hashevolution/building-a-mini-palantir-a-local-graph-rag-engine-with-ontology-security-and-self-evolution-1914"&gt;dev.to write-up&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Repository&lt;/strong&gt;: &lt;a href="https://github.com/Hashevolution/James-RAG-Evol" rel="noopener noreferrer"&gt;&lt;code&gt;Hashevolution/James-RAG-Evol&lt;/code&gt;&lt;/a&gt; — MIT&lt;/p&gt;

&lt;p&gt;Code-quality and security signals for reviewers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.bestpractices.dev/projects/12806" rel="noopener noreferrer"&gt;OpenSSF Best Practices passing badge&lt;/a&gt;&lt;/strong&gt; (Tiered 111%, awarded 2026-05-11)&lt;/li&gt;
&lt;li&gt;7 published GitHub Releases through v0.2.0 (Foundation Hardening, 5/6 axes engineering-complete)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ruff.toml&lt;/code&gt; enforces F821 / F541 / F401 / F841 on every PR via GitHub Actions&lt;/li&gt;
&lt;li&gt;83-item security regression suite (&lt;code&gt;james_security_test.py&lt;/code&gt;): injection, path traversal, prompt injection, unsafe deserialization&lt;/li&gt;
&lt;li&gt;17-item password regression suite (&lt;code&gt;tests/test_password_bcrypt.py&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;bcrypt password storage with transparent SHA-256 → bcrypt migration on first login (PR #173)&lt;/li&gt;
&lt;li&gt;GitHub Private Vulnerability Reporting enabled&lt;/li&gt;
&lt;li&gt;Module-size gate: no file under &lt;code&gt;core/&lt;/code&gt; exceeds 20 KB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Where Gemma 4 lives in the codebase:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;config.py:139&lt;/code&gt; — &lt;code&gt;GEMMA_MODEL = os.environ.get("JAMES_LLM_MODEL", "gemma4:e4b")&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;llm/router.py&lt;/code&gt; — task-aware dispatch (&lt;code&gt;task_type=extract / classify / general / coding / vision&lt;/code&gt;); every production call site declares its task&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;core/reasoning/pipeline.py&lt;/code&gt; — RAG retrieval pipeline with explicit &lt;code&gt;graph_paths&lt;/code&gt; argument carried to the model&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;core/security_layer.py::pre_check&lt;/code&gt; — risky-coding hard-refuse, byte-identical to prompt-injection block&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How I Used Gemma 4
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Model choice: &lt;strong&gt;E4B&lt;/strong&gt; (gemma4:e4b)
&lt;/h3&gt;

&lt;p&gt;Three Gemma 4 variants were available. The choice was forced by single-user, laptop-class constraints:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variant&lt;/th&gt;
&lt;th&gt;Considered for&lt;/th&gt;
&lt;th&gt;Outcome&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;31B Dense&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Server-grade reasoning depth&lt;/td&gt;
&lt;td&gt;❌ Doesn't fit 16 GB RAM; single-user means no throughput need&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;26B MoE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Long-context advanced reasoning&lt;/td&gt;
&lt;td&gt;❌ Expert-routing overhead helps batch workloads; single-user has batch size 1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;E4B (4B effective)&lt;/strong&gt; ⭐&lt;/td&gt;
&lt;td&gt;Edge variant: 4B params, native multimodal, 128K context&lt;/td&gt;
&lt;td&gt;✅ Fits 8 GB GPU or CPU-only laptop, gives the 128K window I need, supports vision for v0.3 multimodal track&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;E2B (2B effective)&lt;/td&gt;
&lt;td&gt;Smaller still&lt;/td&gt;
&lt;td&gt;⚠️ Tested as fallback; reasoning depth too low for graph synthesis at depth 3+&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The deciding factor was the 128K context window — not parameter count.&lt;/strong&gt; Here's why.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why 128K context matters for Graph-RAG specifically
&lt;/h3&gt;

&lt;p&gt;A typical RAG pipeline retrieves top-k chunks and stuffs them into the prompt. Graph-RAG retrieves chunks &lt;em&gt;and&lt;/em&gt; the relations between them — and the relations carry semantic meaning I want the model to reason over, not see as decoration.&lt;/p&gt;

&lt;p&gt;A depth-3 query against my 161-entity wiki produces a context like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[retrieved chunk 1]  (entity A, sensitivity=public)
[retrieved chunk 2]  (entity B, sensitivity=public)
...
[graph_path]  A --[CAUSES]--&amp;gt; X --[REQUIRES]--&amp;gt; Y --[BLOCKED_BY]--&amp;gt; B
[graph_path]  A --[KNOWN_AS]--&amp;gt; A' --[REFERENCES]--&amp;gt; C
[ontology]    relation 'CAUSES'      directed=true  weight=0.85
[ontology]    relation 'BLOCKED_BY'  directed=true  weight=0.92
[instruction] use graph_paths to constrain the answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For real queries at depth 3 this routinely hits &lt;strong&gt;~40K tokens&lt;/strong&gt;. With a 32K-window model (most older OSS LLMs), I'd be silently truncating the graph paths — meaning the model defaults to vector-only reasoning and the ontology becomes decoration. &lt;strong&gt;With Gemma 4's 128K window the full retrieval result fits in one shot&lt;/strong&gt; and the model actually reasons over the relation labels.&lt;/p&gt;

&lt;p&gt;This is the property I designed the rest of the system around. Without 128K, the "Graph-RAG with ontology" claim collapses into "RAG with extra metadata".&lt;/p&gt;

&lt;h3&gt;
  
  
  Native function calling → router &lt;code&gt;task_type&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Gemma 4's native function calling underpins &lt;code&gt;llm/router.py::call_router&lt;/code&gt;, which makes every call declare its purpose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;call_router&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extract&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# entity extraction
&lt;/span&gt;&lt;span class="nf"&gt;call_router&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;classify&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# intent classification
&lt;/span&gt;&lt;span class="nf"&gt;call_router&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;general&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# chat answer
&lt;/span&gt;&lt;span class="nf"&gt;call_router&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# code generation
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The same router can route &lt;code&gt;task_type=coding&lt;/code&gt; to a 32B Coder and &lt;code&gt;task_type=general&lt;/code&gt; to Gemma 4 — but &lt;strong&gt;default for general reasoning is &lt;code&gt;gemma4:e4b&lt;/code&gt;&lt;/strong&gt; because the 128K window dominates everything else at this scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reasoning mode → security policy
&lt;/h3&gt;

&lt;p&gt;Gemma 4's chain-of-thought reasoning is what makes the &lt;strong&gt;risky-coding hard-refuse&lt;/strong&gt; policy actually usable. The block fires &lt;em&gt;before&lt;/em&gt; the LLM is called for clear destructive patterns (regex match in &lt;code&gt;core/security_layer.py::RISKY_CODING_REGEX&lt;/code&gt;), but borderline queries pass &lt;code&gt;pre_check&lt;/code&gt; and the model itself classifies them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Query asks how to perform destructive command on a target → refuse, byte-identical block message&lt;/li&gt;
&lt;li&gt;Query asks about command syntax (documentation) → answer normally&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without a reasoning-capable model, this distinction collapses into "block everything" (false positives) or "answer everything" (security holes). Gemma 4's reasoning is what threads the needle.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I didn't use yet
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Native multimodal retrieval&lt;/strong&gt; — Image OCR (Tesseract, EasyOCR) and video ASR (Whisper) are wired as &lt;em&gt;ingestion&lt;/em&gt; paths, but treating images/audio as first-class graph citizens during retrieval is the v0.3 deliverable. Gemma 4's native vision is ready and waiting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;31B Dense or MoE for server deployment&lt;/strong&gt; — JAMES stays single-machine until v1.0 by design (&lt;code&gt;docs/PLATFORM_READINESS.md&lt;/code&gt;). When multi-tenancy lands, swapping &lt;code&gt;JAMES_LLM_MODEL=gemma4:31b&lt;/code&gt; is a one-env-var change — the router already abstracts it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  One-line summary of the model fit
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;128K context is what lets Graph-RAG &lt;em&gt;be&lt;/em&gt; graph-RAG instead of "RAG with extra metadata". gemma4:e4b is the smallest variant that ships it at a footprint a laptop can hold.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;strong&gt;Looking for&lt;/strong&gt;: adversarial review of the security model, a second user willing to run &lt;code&gt;scripts/bench.py --suite=step7&lt;/code&gt; on their own corpus (that's the v0.2 → v0.3 gate), and critiques of the self-evolution 4-Gate.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/Hashevolution/James-RAG-Evol" rel="noopener noreferrer"&gt;https://github.com/Hashevolution/James-RAG-Evol&lt;/a&gt;&lt;br&gt;
OpenSSF: &lt;a href="https://www.bestpractices.dev/projects/12806" rel="noopener noreferrer"&gt;https://www.bestpractices.dev/projects/12806&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🤖 Honest disclosure: this submission was drafted with AI assistance and edited by the author. The codebase, design decisions, model-choice rationale, and limitations described above are real and verifiable in the linked repository.&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
    </item>
    <item>
      <title>Building a Mini Palantir: A Local Graph-RAG Engine with Ontology, Security, and Self-Evolution (Alpha)</title>
      <dc:creator>Hashevolution</dc:creator>
      <pubDate>Tue, 12 May 2026 08:27:36 +0000</pubDate>
      <link>https://forem.com/hashevolution/building-a-mini-palantir-a-local-graph-rag-engine-with-ontology-security-and-self-evolution-1914</link>
      <guid>https://forem.com/hashevolution/building-a-mini-palantir-a-local-graph-rag-engine-with-ontology-security-and-self-evolution-1914</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://github.com/Hashevolution/James-RAG-Evol" rel="noopener noreferrer"&gt;PROJECT JAMES&lt;/a&gt; is a security-focused, locally-runnable Graph-RAG knowledge engine in Python. It combines an explicit 12-type ontology, 3-stage access control (RBAC + ABAC + instruction isolation), a self-evolution scaffold with audit log, and 100% local execution via Ollama. MIT-licensed, alpha v0.2.0, &lt;a href="https://www.bestpractices.dev/projects/12806" rel="noopener noreferrer"&gt;OpenSSF Best Practices passing&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Why I built this
&lt;/h2&gt;

&lt;p&gt;If you've ever wanted to point a local LLM at your own wiki, codebase, or document store, you've probably hit the same three walls I did:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cloud RAG services&lt;/strong&gt; want everything in their cloud — fine for prototypes, painful for anything sensitive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted RAG frameworks&lt;/strong&gt; are usually one of: (a) too much infrastructure (Kubernetes-shaped), or (b) too few security primitives (no role separation, no audit trail).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Most Graph-RAG implementations&lt;/strong&gt; treat the graph as a side feature on top of vectors. The graph rarely &lt;em&gt;participates&lt;/em&gt; in the security boundary or the reasoning path.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I wanted something closer to &lt;strong&gt;Palantir Foundry's mental model&lt;/strong&gt; — an explicit ontology, capability-token security, a full audit log — but compressed into something one person can run on a laptop, under MIT, without a cloud account.&lt;/p&gt;

&lt;p&gt;That's what PROJECT JAMES is.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Palantir® is a registered trademark of Palantir Technologies Inc. PROJECT JAMES is not affiliated with or endorsed by Palantir. "Mini Palantir" here is a descriptive comparison of the ontology-and-audit-log design pattern, not a product claim.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What's in the box
&lt;/h2&gt;

&lt;p&gt;Five things that rarely show up in the same Python repo:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Graph-RAG with ontology&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;12 relation types; relations carry semantic meaning beyond vector similarity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3-stage security&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;RBAC + ABAC + Instruction Isolation, applied at vector → graph → output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Self-evolution scaffold&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;feedback signals → patch proposals → 4-Gate validation → auto-rollback on bench regression, all with &lt;code&gt;approver_username&lt;/code&gt; audit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100% local&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ollama-based, no cloud LLM dependency. Gemma &lt;code&gt;gemma2:2b&lt;/code&gt; works on a laptop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Explicit reasoning paths&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Every response surfaces the traversed graph paths so you can see &lt;em&gt;why&lt;/em&gt; it answered that way&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Architecture at a glance
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[User query]
     ↓
[Security filter]      ← 31+ injection patterns + risky-coding hard-refuse
     ↓
[Query router]         ← chat / coding / retrieval / web_search
     ↓
[Hybrid search]        ← Vector(60%) + BM25(20%) + keyword(10%) + name(10%)
     ↓
[Graph engine]         ← DFS traversal + confidence pruning + sensitivity gating
     ↓
[Reasoning loop]       ← retrieve → expand → verify
     ↓
[Output filter]        ← PII masking + role-based content filter
     ↓
[Answer + reasoning path]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The graph is not a side index. Every retrieval that reaches the graph engine is gated by the user's role, the entity's sensitivity, and the ontology relation type. Removing the graph would break the security model — they're the same pipeline, not two pipelines glued together.&lt;/p&gt;

&lt;h2&gt;
  
  
  A typical query lifecycle
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Pseudocode for what happens behind /query/
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;User&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. Pre-check: 31+ injection patterns, risky-coding hard-refuse
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;security_layer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pre_check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;BLOCK&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;RESPONSE_BLOCKED&lt;/span&gt;  &lt;span class="c1"&gt;# byte-identical block message
&lt;/span&gt;
    &lt;span class="c1"&gt;# 2. Hybrid retrieval — vector + BM25 + keyword + name match
&lt;/span&gt;    &lt;span class="n"&gt;candidates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;hybrid_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 3. Graph expansion — only visit entities the user can read
&lt;/span&gt;    &lt;span class="n"&gt;paths&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;graph_engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;seed_entities&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                &lt;span class="c1"&gt;# RBAC
&lt;/span&gt;        &lt;span class="n"&gt;sensitivity_ceiling&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tier&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# ABAC
&lt;/span&gt;        &lt;span class="n"&gt;max_depth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 4. Reason over retrieved context (LLM call via router)
&lt;/span&gt;    &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reasoning_trace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;paths&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 5. Output filter — PII mask, role-based redact
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;output_filter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The interesting part is step 3: &lt;strong&gt;the graph traversal itself is access-controlled&lt;/strong&gt;, not just the final output. A &lt;code&gt;confidential&lt;/code&gt; entity is never even &lt;em&gt;traversed&lt;/em&gt; for an &lt;code&gt;employee&lt;/code&gt; user, so the model never sees it. This means no jailbreak prompt can talk the LLM into leaking content it never had in the context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security in depth
&lt;/h2&gt;

&lt;p&gt;A few specific behaviors worth calling out:&lt;/p&gt;

&lt;h3&gt;
  
  
  Hard-refuse for destructive commands
&lt;/h3&gt;

&lt;p&gt;Queries that ask the model to produce filesystem-wide deletion, SQL &lt;code&gt;DROP DATABASE&lt;/code&gt;, &lt;code&gt;git reset --hard&lt;/code&gt;, etc. trigger a &lt;strong&gt;byte-identical block message&lt;/strong&gt; &lt;em&gt;before&lt;/em&gt; the LLM is ever called. The block message is the same string as the prompt-injection block, so an audit consumer cannot distinguish the two externally.&lt;/p&gt;

&lt;p&gt;Patterns live in &lt;code&gt;core/security_layer.py::RISKY_CODING_REGEX&lt;/code&gt;. Korean scope markers (&lt;code&gt;전체&lt;/code&gt;, &lt;code&gt;모든&lt;/code&gt;) are recognized too.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bcrypt password storage with transparent migration
&lt;/h3&gt;

&lt;p&gt;Passwords are stored as &lt;code&gt;bcrypt$&amp;lt;hash&amp;gt;&lt;/code&gt;. Pre-bcrypt SHA-256 hex digests from older deployments are accepted on input only and &lt;strong&gt;rewritten to bcrypt on the next successful login&lt;/strong&gt; — no manual migration needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Audit log everywhere
&lt;/h3&gt;

&lt;p&gt;Every approved self-evolution patch is recorded with &lt;code&gt;approver_username&lt;/code&gt;, &lt;code&gt;approver_role&lt;/code&gt;, &lt;code&gt;approved_at&lt;/code&gt;, and &lt;code&gt;approval_method&lt;/code&gt; in the patch lifecycle JSONL. There is no auto-deploy path that bypasses this — if you bypass it, your fork stops being JAMES.&lt;/p&gt;

&lt;h2&gt;
  
  
  Self-evolution scaffold
&lt;/h2&gt;

&lt;p&gt;This is the part that scares people most when I describe it, so let me be precise about what it does and doesn't do:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Collects feedback signals from &lt;code&gt;/query/&lt;/code&gt; responses (thumbs-up/down, latency, hallucination flags)&lt;/li&gt;
&lt;li&gt;Generates a candidate patch proposal (LLM-assisted)&lt;/li&gt;
&lt;li&gt;Validates it through a 4-Gate pipeline:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gate 1&lt;/strong&gt;: Syntactic — parses, imports, no obvious explosions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gate 2&lt;/strong&gt;: Test suite — existing tests still pass&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gate 3&lt;/strong&gt;: Bench eval — STEP 7 regression suite stays within tolerance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gate 4&lt;/strong&gt;: Human approval — &lt;code&gt;approver_username&lt;/code&gt; required&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Applies the patch with a known-good backup&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-rollback&lt;/strong&gt; if Gate 3 detects a post-deploy regression&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;What it does NOT do:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It does not auto-deploy without &lt;code&gt;approver_username&lt;/code&gt;. If you set &lt;code&gt;JAMES_AUTO_APPROVE=1&lt;/code&gt;, the server refuses to start unless &lt;code&gt;JAMES_DEV_MODE=1&lt;/code&gt; is also set.&lt;/li&gt;
&lt;li&gt;It does not modify trust boundaries (auth, policy, sandbox) without an explicit &lt;code&gt;architecture&lt;/code&gt; PR label.&lt;/li&gt;
&lt;li&gt;It does not touch security-critical files inside &lt;code&gt;core/security_layer.py&lt;/code&gt; or &lt;code&gt;core/policy_engine.py&lt;/code&gt; automatically.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The default deployment ships with &lt;code&gt;JAMES_ENABLE_EVOLUTION=0&lt;/code&gt;. You have to opt in.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it's NOT — honest limitations
&lt;/h2&gt;

&lt;p&gt;PROJECT JAMES is alpha. Here's what doesn't work yet:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real-data validation is the v0.2 → v0.3 gate.&lt;/strong&gt; The internal STEP 7 suite passes (13 queries, security-block invariants, graph-paths bands), but the next gate is &lt;em&gt;a second user running the bench end-to-end on their own corpus&lt;/em&gt;. That's a recruitment problem, not a coding problem, and I'm honest about it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multimodal retrieval is v0.3.&lt;/strong&gt; Video-ASR (Whisper) and image OCR (Tesseract, EasyOCR) are wired and work as ingestion paths, but multimodal retrieval as a first-class graph citizen is the next milestone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-evolution is verified single-user.&lt;/strong&gt; It works on my machine. It has not been adversarially probed by a second user yet. Don't enable it in production.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plugin API is v0.3.&lt;/strong&gt; Domain packs (legal, food, retail, travel) are deliberately blocked until v1.0 — see &lt;code&gt;docs/PLATFORM_READINESS.md&lt;/code&gt; for the gate definitions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Trust signals
&lt;/h2&gt;

&lt;p&gt;External validation that matters more than my self-assessment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.bestpractices.dev/projects/12806" rel="noopener noreferrer"&gt;OpenSSF Best Practices passing badge&lt;/a&gt;&lt;/strong&gt; (Tiered 111%, awarded 2026-05-11)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;7 published GitHub Releases&lt;/strong&gt; through v0.2.0 (Foundation Hardening)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Static analysis&lt;/strong&gt; — ruff F-class rules (F821 + F541 + F401 + F841) enforced on every PR via GitHub Actions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security tests&lt;/strong&gt; — 83-item adversarial regression suite (&lt;code&gt;james_security_test.py&lt;/code&gt;) covering injection, path traversal, prompt injection, unsafe deserialization; 17-item password regression suite (&lt;code&gt;tests/test_password_bcrypt.py&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vulnerability disclosure&lt;/strong&gt; — GitHub Private Vulnerability Reporting enabled; backup channel documented in &lt;code&gt;SECURITY.md&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MIT-licensed&lt;/strong&gt;, with &lt;code&gt;CONTRIBUTING.md&lt;/code&gt; test-policy gate&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Hashevolution/James-RAG-Evol
&lt;span class="nb"&gt;cd &lt;/span&gt;James-RAG-Evol

&lt;span class="c"&gt;# Configure&lt;/span&gt;
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# Edit .env — set JAMES_API_KEY, JAMES_JWT_SECRET (32-char random)&lt;/span&gt;

&lt;span class="c"&gt;# Install (Python 3.11+)&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="c"&gt;# Pull a model&lt;/span&gt;
ollama pull gemma2:2b   &lt;span class="c"&gt;# 1.6 GB, runs on a laptop&lt;/span&gt;

&lt;span class="c"&gt;# Start&lt;/span&gt;
python server_llmwiki.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then &lt;code&gt;http://localhost:8000&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this is going
&lt;/h2&gt;

&lt;p&gt;Short-term roadmap:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;v0.2.1&lt;/strong&gt;: Recruitment for the second-user real-data validation gate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.3.0&lt;/strong&gt;: Plugin API skeleton — &lt;code&gt;core/plugins/base.py&lt;/code&gt; with 4 plugin interfaces, &lt;code&gt;JAMES_PLUGINS&lt;/code&gt; loader, &lt;code&gt;packs/general/&lt;/code&gt; dogfood, multi-instance &lt;code&gt;JAMES_WORKSPACE&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v1.0&lt;/strong&gt;: Production hardening + first domain packs (legal, retail, etc. only after this gate)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The bigger frame is in &lt;code&gt;docs/PLATFORM_READINESS.md&lt;/code&gt;: PROJECT JAMES is a &lt;em&gt;mother platform&lt;/em&gt; until v1.0. Domain forks happen after, not before. That's the discipline of the project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Feedback welcome
&lt;/h2&gt;

&lt;p&gt;I'm specifically looking for:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Adversarial review of the security model&lt;/strong&gt; — the boundary, the audit log, the hard-refuse policy. If you can break the role separation, please open a private advisory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A second-user corpus&lt;/strong&gt;. If you've got a wiki/document store you can point this at and run &lt;code&gt;scripts/bench.py --suite=step7 --check&lt;/code&gt; on, I want to know what breaks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Critiques of the self-evolution scaffold&lt;/strong&gt; — particularly whether the 4-Gate is &lt;em&gt;enough&lt;/em&gt; gating, or whether it needs another stage before Gate 4.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Repo: &lt;a href="https://github.com/Hashevolution/James-RAG-Evol" rel="noopener noreferrer"&gt;https://github.com/Hashevolution/James-RAG-Evol&lt;/a&gt;&lt;br&gt;
Discussions: GitHub Issues&lt;br&gt;
Security: GitHub Private Vulnerability Reporting (preferred), &lt;code&gt;karu-7@hanmail.net&lt;/code&gt; (backup)&lt;/p&gt;

&lt;p&gt;If you build something on top of it, I'd love to hear about it.&lt;/p&gt;




&lt;p&gt;🤖 Honest disclosure: this article was drafted with AI assistance and edited by the author. The codebase, design decisions, and limitations described here are real and verifiable&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Updates (2026-05-12)&lt;/strong&gt;: Submitted to the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge&lt;/a&gt; with a follow-up article on the model-choice rationale:&lt;br&gt;
&lt;strong&gt;&lt;a href="https://dev.to/hashevolution/building-a-mini-palantir-on-gemma4e4b-128k-context-lets-the-graph-actually-be-graph-rag-33fk"&gt;Building a Mini Palantir on gemma4:e4b — 128K Context Lets the Graph Actually Be Graph-RAG&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>llm</category>
      <category>python</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
