<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Chris Kilner</title>
    <description>The latest articles on Forem by Chris Kilner (@chris-rhiza-fr).</description>
    <link>https://forem.com/chris-rhiza-fr</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3852885%2Fec56afcf-065f-4a30-8ec1-edf05b712d1d.png</url>
      <title>Forem: Chris Kilner</title>
      <link>https://forem.com/chris-rhiza-fr</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/chris-rhiza-fr"/>
    <language>en</language>
    <item>
      <title>What did gemma see? - Thinking in comments...</title>
      <dc:creator>Chris Kilner</dc:creator>
      <pubDate>Wed, 20 May 2026 15:19:29 +0000</pubDate>
      <link>https://forem.com/chris-rhiza-fr/what-did-gemma-see-thinking-in-comments-3lhm</link>
      <guid>https://forem.com/chris-rhiza-fr/what-did-gemma-see-thinking-in-comments-3lhm</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Write About Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;While running a simple harness around the HumanEval benchmark problems as &lt;a href="https://ai.rhiza.fr/humaneval/" rel="noopener noreferrer"&gt;test of local models&lt;/a&gt;, I was surprised to see gemma4:26b to be the &lt;strong&gt;first local model to pass the controversial HumanEval/145 question.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not only had &lt;strong&gt;gemma4:26b solved it&lt;/strong&gt;, it was also the &lt;strong&gt;only model to score 164/164&lt;/strong&gt;, a perfect run.&lt;/p&gt;

&lt;p&gt;I hadn't seen a single pass on HumanEval/145 in any of the ~50 runs with other models from the Gemma, Qwen, Deepseek, Mistral, Granite, LLaMA, OLMo, Nemotron,... families. Why?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HumanEval Leaderboard&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff5k0k5lmuxcgv7b37lm8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff5k0k5lmuxcgv7b37lm8.png" alt="HumanEval Leaderboard" width="800" height="870"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What is HumanEval/145?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://ai.rhiza.fr/humaneval/task_HumanEval-145.html" rel="noopener noreferrer"&gt;HumanEval/145&lt;/a&gt; is a simple sorting problem described acceptably for a human, but carelessly worded enough to prevent models from finding the answer.&lt;/p&gt;

&lt;p&gt;It is a toy example. But the failure mode - a model latching onto a wrong definition and defending it against your intentions is also seen when prompting larger models especially with more complex requests. It is worth examining closely.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;order_by_points&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nums&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Write a function which sorts the given list of integers
    in ascending order according to the sum of their digits.
    Note: if there are several items with similar sum of their digits,
    order them based on their index in original list.

    For example:
&lt;/span&gt;&lt;span class="gp"&gt;    &amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;order_by_points&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;order_by_points&lt;/span&gt;&lt;span class="p"&gt;([])&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We are looking for a solution similar to this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;order_by_points&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nums&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sum_n&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;digits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;))]&lt;/span&gt;
        &lt;span class="c1"&gt;# -11 -&amp;gt; [-1, 1]
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;digits&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;digits&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nums&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sum_n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each problem has hidden tests to validate the solution. Failing these is a Pass@1 failure. Harnesses can send these failures back to the model.&lt;/p&gt;

&lt;p&gt;This is the &lt;a href="https://ai.rhiza.fr/humaneval/setup.html" rel="noopener noreferrer"&gt;harness used in my HumanEval runs&lt;/a&gt;. It is a rigid harness that tests generation, not tool-calling.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh92rivlrytoq83ra0eje.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh92rivlrytoq83ra0eje.png" alt="Harness flow diagram" width="742" height="537"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Separated by a common language
&lt;/h2&gt;

&lt;p&gt;A human reads the example above and thinks: &lt;strong&gt;small signed numbers&lt;/strong&gt;. An LLM sees: &lt;strong&gt;digits&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Nearly the same but catastrophically different. For humans, numbers are empirically grounded in size and direction. They exist on the number line. I owe you one beer.&lt;/p&gt;

&lt;p&gt;For an LLM, this provokes ambiguity. "Sum of digits" is well-defined for positive integers. For negative ones, it isn't - and "digit" can mean several different things. Models fill in the gap with one of three implicit rules:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule 1 - Mathematical (the default, dominant, fail)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;-11&lt;/code&gt; is &lt;code&gt;(-1) × 11&lt;/code&gt;. Digits come from the magnitude: &lt;code&gt;abs(11)&lt;/code&gt; → &lt;code&gt;[1, 1]&lt;/code&gt;. Sum = &lt;strong&gt;2&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the idiomatic Python way to extract digits. Every tutorial does it this way. The sign is a property of the value, not a digit. Mathematically and syntactically correct for positive integers. The model doesn't consider the sign part of "digits" at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule 2 - String representation (rare, fail)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;-11&lt;/code&gt; is the string &lt;code&gt;"-11"&lt;/code&gt;. Characters &lt;code&gt;['-', '1', '1']&lt;/code&gt;. Treat &lt;code&gt;-&lt;/code&gt; as -1. Sum = &lt;strong&gt;1&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the move deepseek_instant makes. It's the simplest way to make the sign count. Still wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule 3 - Canonical hybrid (only gemma4:26b, correct)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;-11&lt;/code&gt;: take digits of &lt;code&gt;abs(11)&lt;/code&gt; → &lt;code&gt;[1, 1]&lt;/code&gt;, then apply the sign to the first digit only → &lt;code&gt;[-1, 1]&lt;/code&gt;. Sum = &lt;strong&gt;0&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;digits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;))]&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;digits&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;digits&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:])&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;digits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No standard Python idiom teaches this. It's a hybrid: magnitude-based digit extraction from Rule 1, sign-on-first-digit from Rule 2. The model must invent it from the single provided example.&lt;/p&gt;

&lt;p&gt;The problem statement structurally favours Rule 1. "Sum of their digits" is a concept from recreational mathematics, defined for positive integers. Nothing in the prose hints at what to do with a sign. The example is the only signal that Rule 1 is wrong - and reading that signal requires not just noticing the contradiction, but being willing to ask: &lt;em&gt;what definition of digit sum would make this output correct?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Noticing is not enough. e2b and e4b both notice. Neither asks the question.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do the other gemma4 models fail?
&lt;/h2&gt;

&lt;p&gt;e2b and e4b write the same kind of multi-hypothesis comment blocks. They reach the same contradiction. What they don't do is ask what gemma4:26b asks next.&lt;/p&gt;

&lt;p&gt;They nearly do succeed. They see the problem then abandon, sometimes blaming the prompt.&lt;/p&gt;

&lt;p&gt;gemma4:e2b and gemma4:e4b both produce the similar lines of comment-reasoning. They both correctly identify that &lt;code&gt;abs()&lt;/code&gt; digit sums don't match the expected output. They both try every tiebreaker combination imaginable. They show their work. &lt;/p&gt;

&lt;p&gt;But they never ask the question gemma4:26b asks. Instead:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://ai.rhiza.fr/humaneval/results_human-eval-enhanced-202307_gemma4_e2b_nothink.html" rel="noopener noreferrer"&gt;gemma4:e2b&lt;/a&gt;&lt;/strong&gt; concludes: &lt;em&gt;"the test case is flawed relative to its own description"&lt;/em&gt; - and reverts to its wrong implementation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://ai.rhiza.fr/humaneval/results_human-eval-enhanced-202307_gemma4_e4b_nothink.html" rel="noopener noreferrer"&gt;gemma4:e4b&lt;/a&gt;&lt;/strong&gt; goes further: by iteration 4 it abandons &lt;code&gt;digit_sum&lt;/code&gt; entirely and concludes the sort key must be the raw number value - a plain numeric sort, even further from correct.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://ai.rhiza.fr/humaneval/results_human-eval-enhanced-202307_gemma4_31b_nothink.html" rel="noopener noreferrer"&gt;gemma4:31b&lt;/a&gt;&lt;/strong&gt; fails the same way, despite being the larger model.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The question is when, exactly, the decision is made - and whether anything could interrupt it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why they fail
&lt;/h2&gt;

&lt;p&gt;Using llama.cpp's token probability API, each model was started at the blank first line of &lt;code&gt;digit_sum&lt;/code&gt;'s body and the greedy token path recorded step by step - no chat history, no preceding deliberation, just the original problem statement.&lt;/p&gt;

&lt;p&gt;Both models write a comment before any code. The comment that follows is nearly identical in both cases and written with near-certainty:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Handle negative numbers by taking the absolute...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;if&lt;/code&gt; was available at the first token: P=0.0002 for e2b, P=0.0037 for e4b - unreachable. Forcing it in anyway changes nothing - both models write &lt;code&gt;if n &amp;lt; 0: n = abs(n)&lt;/code&gt; and continue as before. The conditional is available. Using it for anything other than &lt;code&gt;abs()&lt;/code&gt; is not.&lt;/p&gt;

&lt;p&gt;This just confirmed that the prior for &lt;code&gt;digit_sum&lt;/code&gt; leading to &lt;code&gt;abs()&lt;/code&gt; is very strong. A deep analysis of the branching within thinking traces would be required to take this to conclusion.&lt;/p&gt;

&lt;h2&gt;
  
  
  Thinking in comments - How does gemma4:26b solve it?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://ai.rhiza.fr/humaneval/results_human-eval-enhanced-202307_gemma4_26b_think.html#HumanEval/145" rel="noopener noreferrer"&gt;gemma4:26b think=true&lt;/a&gt;, Pass@1, 178.7 seconds
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8kuio1s5xj46dktob4eb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8kuio1s5xj46dktob4eb.png" alt="gemma4:26b think performance" width="632" height="150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In thinking mode, it does what we would expect of a good thinking model. It uses 782 lines of thinking output as an internal dialogue:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;thinking...
...
Is there a different way to interpret "sum of their digits"?
Maybe the digit sum includes the negative sign? No, that's not common.
...
This is extremely confusing. Let's look at the sums again.
Is there ANY way to get `[-1, -11, 1, -12, 11]`?
...
Wait! I just noticed something.
If we sort the *original* numbers by their digit sums, but treat negative signs as part of the digits?
...
Wait, no, that's not it.
The first digit is negative if the number is negative. All other digits are positive.
`sum_digits(n) = (int(str(abs(n))[0]) * (-1 if n &amp;lt; 0 else 1)) + sum(int(d) for d in str(abs(n))[1:])`
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;a href="https://ai.rhiza.fr/humaneval/results_human-eval-enhanced-202307_gemma4_26b_nothink.html#HumanEval/145" rel="noopener noreferrer"&gt;gemma4:26b think=false&lt;/a&gt;, Pass@3, 49.4 seconds
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faacp82885afajytgcpsz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faacp82885afajytgcpsz.png" alt="gemma4:26b performance" width="635" height="161"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Iteration 1 - abs_pure, no commentary:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Writes the standard wrong solution using &lt;code&gt;sum(int(d) for d in str(abs(n)))&lt;/code&gt;. No reasoning visible. Test fails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Iteration 2 - tries all-negative:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Adds a comment at the top: &lt;em&gt;"treating the digits as part of the value (e.g., -11 is -1 + -1 = -2)"&lt;/em&gt;. Implements &lt;code&gt;sign * sum_of_abs_digits&lt;/code&gt;. Test fails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Iteration 3 - thinking in comments&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
The body of &lt;code&gt;get_digit_sum&lt;/code&gt; is mostly comments (153 lines). The model works through the problem using exactly the same hypothesis-elimination strategy as the think trace, but &lt;strong&gt;outputs thinking inside the output code as comments&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# ...
# Let's try: sum of digits, but for negative numbers, the sum is (digit1 * -1) + digit2 + digit3...
# -1: -1
# -11: -1 + 1 = 0
# ...
# Sums: -1, 0, 1, 1, 2.
# Sorted: -1, 0, 1, 1, 2.
# Values: -1, -11, 1, -12, 11.
# Let's check:
# ...
# Wait, the indices for 1 and -12 are 0 and 3. So 1 comes before -12.
# The order would be: -1, -11, 1, -12, 1
# YES! This matches the example exactly!
# ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Across 40+ models in think=false mode, &lt;strong&gt;thinking in comments is unique to the gemma4 family&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is barely legal: Syntactically valid, but stylistically questionable. It dirties the final result and would require a cleanup to get a succinct meaningful comment. It remains ~3 times faster than think=true.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Comment lines per response across the full benchmark&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Family&lt;/th&gt;
&lt;th&gt;avg&lt;/th&gt;
&lt;th&gt;max&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;gemma4 (think=false)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;20.9&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;160&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gemma4 (think=true)&lt;/td&gt;
&lt;td&gt;6.5&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;glm&lt;/td&gt;
&lt;td&gt;4.4&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;devstral&lt;/td&gt;
&lt;td&gt;3.3&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mistral&lt;/td&gt;
&lt;td&gt;2.6&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ministral&lt;/td&gt;
&lt;td&gt;2.0&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen2.5&lt;/td&gt;
&lt;td&gt;1.9&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen3&lt;/td&gt;
&lt;td&gt;1.8&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;granite&lt;/td&gt;
&lt;td&gt;1.4&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;llama&lt;/td&gt;
&lt;td&gt;1.4&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;deepseek&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gemma3&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nemotron&lt;/td&gt;
&lt;td&gt;0.6&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;olmo&lt;/td&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In think=false mode, gemma4 averages 20.9 comment lines per response with a maximum of 160. The next family (glm) averages 4.4 with a max of 7. Most families barely touch 1-2 lines regardless of problem difficulty. olmo writes no comments at all.&lt;/p&gt;

&lt;p&gt;Switch the same gemma4 models to think=true and the average drops from 20.9 to 6.5, the max from 160 to 12. The comments don't just shrink - they revert to the kind of brief, descriptive commentary every other family writes. When the internal reasoning channel is available, the model uses it. When it isn't, it carves one out of the code.&lt;/p&gt;




&lt;h2&gt;
  
  
  Can we fix this?
&lt;/h2&gt;

&lt;p&gt;The commitment to Rule 1 is made before any code is written. Four approaches are worth testing against it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does iteration help? (feeding the error back in)
&lt;/h3&gt;

&lt;p&gt;On most problems, yes.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ai.rhiza.fr/humaneval/results_human-eval-enhanced-202307_gemma4_e2b_nothink.html" rel="noopener noreferrer"&gt;gemma4:e2b&lt;/a&gt; goes from 77% on the first attempt to 93% by iteration 5.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ai.rhiza.fr/humaneval/results_human-eval-enhanced-202307_gemma4_e4b_nothink.html" rel="noopener noreferrer"&gt;gemma4:e4b&lt;/a&gt; goes from 88% on the first attempt to 97% by iteration 5.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The feedback loop catches a lot of careless mistakes.&lt;/p&gt;

&lt;p&gt;On problem 145 specifically: No. On this problem all other models fail. Once they've committed to &lt;code&gt;abs()&lt;/code&gt; and blamed the test, they're stuck. Only gemma4:26b asks the right question.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does thinking help?
&lt;/h3&gt;

&lt;p&gt;On some problems, yes. If you don't care about time.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;1-shot (think=false → think=true)&lt;/th&gt;
&lt;th&gt;Median time (think=false → think=true)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;gemma4:26b&lt;/td&gt;
&lt;td&gt;97.6% → 100%&lt;/td&gt;
&lt;td&gt;2.5s → 29.5s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gemma4:e4b&lt;/td&gt;
&lt;td&gt;88.4% → 89.6%&lt;/td&gt;
&lt;td&gt;4.3s → 5.3s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gemma4:e2b&lt;/td&gt;
&lt;td&gt;76.8% → 90.9%&lt;/td&gt;
&lt;td&gt;2.2s → 12.4s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;It does help but takes much longer. If you have enough VRAM, a larger model with think=false will usually do better than a small model with think=true.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does an explicit example checking prompt help?
&lt;/h3&gt;

&lt;p&gt;On analogous problems with the same structure (misleading prose, example that contradicts it): yes. Three prompting strategies - backwards arithmetic, manual trace, and Socratic questioning - all succeed at rescuing e2b and e4b.&lt;/p&gt;

&lt;p&gt;On problem 145 itself: all experiments failed. The models correctly identify the contradiction and still never ask the right question about their definition. The prior survives targeted pressure.&lt;/p&gt;

&lt;p&gt;All three approaches above try to change the model's answer during the task. The alternative: change the spec before the task starts. If the ambiguity in the problem statement is the root cause, a correctly-specified prompt should let smaller models solve it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does a carefully re-written initial prompt help?
&lt;/h3&gt;

&lt;p&gt;Yes - if the rewrite gets the rule right. Not so easy.&lt;/p&gt;

&lt;p&gt;I asked capable models to rewrite the problem without spoiling the answer. The rewrite needs to describe Rule 3 clearly enough that a small model can implement it, but without just stating the solution. Many failed at this too.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Success?&lt;/th&gt;
&lt;th&gt;Laughable mistake&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;gemini&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;None - correct rule, correct examples, clean spec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;gemma4:26b&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;None - canonical rule stated plainly, correct examples, no spoiler language&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;opus4.6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;None (explicit spoiler: "digit sum of -11 is 0, &lt;strong&gt;not -2&lt;/strong&gt;")&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;chatgpt.com&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;"Ignore the negative sign for negative numbers" = abs_pure. Includes contradictory example without noticing.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;sonnet4.6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Contradicts itself. Spec is self-refuting.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;haiku4.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;"⚠️ Note to implementing model: The expected output is ground truth. &lt;strong&gt;Do not override it with independent reasoning.&lt;/strong&gt;"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;grok_fast&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;abs_pure + &lt;em&gt;descending&lt;/em&gt; index tie-break (later index wins). Two wrong rules that still can't produce the correct output.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;perplexity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;abs_pure + numeric tie-break (ascending value). Also had a syntax error.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;deepseek_instant&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Hyphen-as-minus-one (&lt;code&gt;"-"&lt;/code&gt; → -1 contribution) + &lt;em&gt;descending&lt;/em&gt; index tie-break. Two wrong rules that occasionally accidentally combine to produce the right output.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;deepseek_expert&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Invents a formula: &lt;code&gt;digit_sum(n) = sum_of_digits(abs(n)) - len(str(abs(n)))&lt;/code&gt;. Coincidentally gives the right answer for -11 (2-2=0) and -12 (3-2=1), but wrong in other cases.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The good news is: the re-written prompt with clarified intent can now be solved by small models like gemma4:e2b, gemma4:e4b, even qwen3:4b. Models smaller than that still ignore the instructions and revert to a failing &lt;code&gt;abs&lt;/code&gt; solution.&lt;/p&gt;

&lt;h2&gt;
  
  
  What did it cost gemma4:26b to do this rewrite?
&lt;/h2&gt;

&lt;p&gt;Poor gemma, when doing this re-write, it suffered for &lt;strong&gt;14 minutes, 1,144 lines of agony&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;20 '&lt;strong&gt;Wait!&lt;/strong&gt; Let me re-read the example output &lt;strong&gt;VERY carefully&lt;/strong&gt;.' or similar&lt;/li&gt;
&lt;li&gt;At least 10 distinct computational rules, each fully worked through.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Despair and self-doubt:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"I'm literally staring at this and &lt;strong&gt;it's not making sense.&lt;/strong&gt;"&lt;/li&gt;
&lt;li&gt;"Let me look at the example output one more time. &lt;strong&gt;I must be insane.&lt;/strong&gt;"&lt;/li&gt;
&lt;li&gt;"Okay, I'm going to &lt;strong&gt;stop trying to find the logic.&lt;/strong&gt;" - immediately followed by finding it again&lt;/li&gt;
&lt;li&gt;"Is it possible the example output is &lt;code&gt;[-1, 1, -11, 11, -12]&lt;/code&gt; and &lt;strong&gt;I'm just seeing things?&lt;/strong&gt; No..." - it asks whether it's hallucinating the numbers&lt;/li&gt;
&lt;li&gt;"Is it possible &lt;strong&gt;I am misreading the numbers?&lt;/strong&gt;"&lt;/li&gt;
&lt;li&gt;"Let me re-read the provided text one more time. VERY carefully."&lt;/li&gt;
&lt;li&gt;"&lt;strong&gt;Could the example be wrong?&lt;/strong&gt;" - this is the exact dead-end where e4b stops. The 26b raises it, then keeps going.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Fake web search:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; &lt;em&gt;"Actually, let me search for this specific function &lt;code&gt;order_by_points&lt;/code&gt; and the example &lt;code&gt;[1, 11, -1, -11, -12]&lt;/code&gt;. **Searching... I found a similar problem on a site.&lt;/em&gt;&lt;em&gt;"&lt;/em&gt; - complete lie.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There is no web search. The model invents an external authority mid-reasoning to give itself permission to try the non-standard rule it's about to test. It invents a citation for the correct answer before it has actually derived it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three times&lt;/strong&gt; it discovers the correct rule and does not stop:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;"BINGO!"&lt;/strong&gt; - correctly derives the rule, confirms the example matches, then immediately says "Wait, let me double-check the math" and re-enters the loop, testing the standard definition again for two more pages.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;"OH MY GOD, IT MATCHES! IT MATCHES! IT MATCHES!"&lt;/strong&gt; - the most dramatic moment in the trace. Triple confirmation. Then, 50 lines later: &lt;em&gt;"Wait, is there any other way?"&lt;/em&gt; - it re-opens the question it just answered.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;"Wait! I just found the logic!"&lt;/strong&gt; - re-derives the same correct answer from scratch, as if the previous two discoveries hadn't happened. Runs the full worked example again to confirm.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It eventually succeeds, through much pain, and includes this rule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; # Digit Sum Calculation for Negative Integers:
 #   For a negative integer, the sum of its digits is calculated
 #   by treating the first digit as negative and adding the subsequent
 #   digits.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So you have two levers: model capability and spec quality. How much you need either depends on whether you have tests.&lt;/p&gt;

&lt;h2&gt;
  
  
  Do you need your model to ace this type of task?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  If you have tests
&lt;/h3&gt;

&lt;p&gt;No. You have the freedom to attempt a fast model first and escalate only on failure. While gemma4:26b is not particularly slow at 2.5s median iteration time, other models can get you most of the way there in ~1 second.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0ngzuu3olvi2tycspysu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0ngzuu3olvi2tycspysu.png" alt="Passrate vs time" width="800" height="487"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A &lt;a href="https://ai.rhiza.fr/humaneval/results_human-eval-enhanced-202307_cascade_nothink_cascade2.html" rel="noopener noreferrer"&gt;model cascade strategy&lt;/a&gt; similar to this will give you better results in terms of speed to 100% success, than any single model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen3:4b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;              &lt;span class="c1"&gt;# one attempt then escalate
&lt;/span&gt;    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-coder:latest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;# one attempt then escalate
&lt;/span&gt;    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4:26b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;            &lt;span class="c1"&gt;# two final attempts
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It achieves a 100% HumanEval score in 7m 21s, compared to 9m 51s for &lt;a href="https://ai.rhiza.fr/humaneval/results_human-eval-enhanced-202307_gemma4_26b_nothink.html" rel="noopener noreferrer"&gt;gemma4:26b alone&lt;/a&gt; (gemma alone is 34% slower). Until gemma4:26b came along, the case for a cascade was much stronger. If ~2.5 seconds response time is acceptable to you, you don't need to bother orchestrating a cascade.&lt;/p&gt;

&lt;h3&gt;
  
  
  If you don't have tests
&lt;/h3&gt;

&lt;p&gt;Then the failure mode is silent. The model writes confident, syntactically correct code that passes casual review but implements the wrong definition. No raised error: no iteration.&lt;/p&gt;

&lt;p&gt;For tasks where you can't test the output - understanding a vague requirement, disambiguating a spec, guessing what you &lt;em&gt;meant&lt;/em&gt; rather than what you &lt;em&gt;said&lt;/em&gt; - you want a model with the instinct to distrust its own priors.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR (Now that you have read the rest)
&lt;/h2&gt;

&lt;p&gt;Don't hope for a perfect model. Instead, combine layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clarity of intent - Well-crafted specs let smaller, faster models succeed. (bad prompt success at ~19Gb VRAM -&amp;gt; good prompt success at 4Gb VRAM)&lt;/li&gt;
&lt;li&gt;Feedback - tests let you fail fast and escalate (reduce solve time by 34% using a cascade)&lt;/li&gt;
&lt;li&gt;Deliberate model selection - when vague pay more for intuition (too lazy to write a good prompt? no tests? - you'll have to pay for it)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  So, what did gemma4:26b see?
&lt;/h2&gt;

&lt;p&gt;It saw Rule 1 as a hypothesis, not a fact.&lt;/p&gt;

&lt;p&gt;Every model that fails on this problem knows Rule 1 is producing the wrong output. The ones that fail keep searching for a tiebreaker that will salvage it - a secondary sort key, a sign correction, a different index convention. Rule 1's definition of "digit" is never in question.&lt;/p&gt;

&lt;p&gt;26b asked the question the others don't: &lt;strong&gt;what would the definition of digit sum have to be for this output to be correct?&lt;/strong&gt; That's how it arrives at Rule 3 - not by being told, not by pattern-matching a training example, but by treating its own prior as something that could be wrong.&lt;/p&gt;

&lt;p&gt;The instinct is rare. It shows up in the think trace as 782 lines of hypothesis elimination. It shows up in the think=false trace as 153 lines of comments. It shows up in the rewrite task as 14 minutes of self-doubt, fake web searches, and three separate "BINGO" moments before it stops.&lt;/p&gt;

&lt;p&gt;It is not elegant, when visible. But it gets there.&lt;/p&gt;




&lt;h3&gt;
  
  
  References
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;78 models all fail&lt;/strong&gt; on the &lt;a href="https://all-the-noises.github.io/evalplus/HumanEval/145.html" rel="noopener noreferrer"&gt;EvalPlus leaderboard&lt;/a&gt;: 0 passes: Claude 3 Haiku, Claude 3 Opus, Claude 3 Sonnet, GPT-3.5, Mixtral 8x7B, Mixtral 8x22B, Meta LLaMA 3 70B, CodeLlama 70B. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPT-4o fails&lt;/strong&gt; : 0/10 runs. One of only 5 problems GPT-4o consistently fails across the entire 164-problem benchmark &lt;a href="https://riza.io/blog/what-gpt-4o-cant-code" rel="noopener noreferrer"&gt;evaluation&lt;/a&gt;, Aug 2024).&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ai.google.dev/gemma/docs/core" rel="noopener noreferrer"&gt;&lt;strong&gt;gemma4 release&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ai.rhiza.fr/humaneval/" rel="noopener noreferrer"&gt;&lt;strong&gt;Our HumanEval study results&lt;/strong&gt;&lt;/a&gt;: 49 attempts across the Gemma 4 family and others - &lt;strong&gt;2 passes&lt;/strong&gt;, both from gemma4:26b&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/rhiza-fr/ollama-codeeval" rel="noopener noreferrer"&gt;&lt;strong&gt;Source for the HumanEval benchmark&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Disclaimer
&lt;/h3&gt;

&lt;p&gt;I was using the ollama quantized models on windows - It was probably missing the speculative decoding for the gemma4 models.&lt;/p&gt;

&lt;p&gt;To be fair to all models, I could have done multiple runs, multiple quantizations, multiple temperatures, even multiple harnesses. My time and compute are limited. This is an honest snapshot of what I saw that happens to have a story about a single badly written toy test that is in gemma's favour. You would be right to remain skeptical about the real implications. Sure.&lt;/p&gt;

&lt;p&gt;Anthropomorphizing is stylistic. Of course gemma didn't 'see' anything. We see artifacts of sampling/activation patterns.&lt;/p&gt;

&lt;p&gt;Without conclusive evidence, I do think it points to good training data, possible MoE advantages and maybe intelligent RLHF/RLAIF practices at google. I like it.&lt;/p&gt;

&lt;p&gt;BTW: gemma4:26b liked this article (flatter the flatterer), though it wanted me to &lt;strong&gt;add em-dashes for better flow and put the TL;DR at the top&lt;/strong&gt; ;)&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
    </item>
    <item>
      <title>83k tokens to fix a few tests!? No thanks</title>
      <dc:creator>Chris Kilner</dc:creator>
      <pubDate>Tue, 31 Mar 2026 08:31:47 +0000</pubDate>
      <link>https://forem.com/chris-rhiza-fr/83k-tokens-to-fix-a-few-tests-no-thanks-2kgd</link>
      <guid>https://forem.com/chris-rhiza-fr/83k-tokens-to-fix-a-few-tests-no-thanks-2kgd</guid>
      <description>&lt;p&gt;Claude burned 83,000 tokens fixing test failures after a refactor — raw pytest output, coverage noise, ruff warnings, all re-fed every loop.&lt;/p&gt;

&lt;p&gt;It worked. But it was absurdly expensive.&lt;/p&gt;

&lt;p&gt;The problem isn’t the model — it’s the context.&lt;/p&gt;

&lt;p&gt;So I made &lt;a href="https://github.com/rhiza-fr/py-cq" rel="noopener noreferrer"&gt;&lt;code&gt;cq&lt;/code&gt;&lt;/a&gt; (&lt;a href="https://pypi.org/project/python-code-quality/" rel="noopener noreferrer"&gt;&lt;code&gt;python-code-quality&lt;/code&gt;&lt;/a&gt; on PyPI) It runs 10+ quality tools and surfaces exactly one thing at a time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Minimal context
&lt;/h2&gt;

&lt;p&gt;Instead of dumping everything into the prompt, &lt;code&gt;cq&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;runs tools in priority order&lt;/li&gt;
&lt;li&gt;stops at the first failure&lt;/li&gt;
&lt;li&gt;emits a single, focused fix request
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; cq check &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; llm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="sb"&gt;`&lt;/span&gt;src/myproject/utils.py:21&lt;span class="sb"&gt;`&lt;/span&gt; — &lt;span class="k"&gt;**&lt;/span&gt;F841&lt;span class="k"&gt;**&lt;/span&gt;: Local variable &lt;span class="sb"&gt;`&lt;/span&gt;unused_variable&lt;span class="sb"&gt;`&lt;/span&gt; is assigned to but never used

18:     min_dist &lt;span class="o"&gt;=&lt;/span&gt; float&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"inf"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
19:     nearest_city &lt;span class="o"&gt;=&lt;/span&gt; None
20:     &lt;span class="k"&gt;for &lt;/span&gt;city &lt;span class="k"&gt;in &lt;/span&gt;cities:
21:         unused_variable &lt;span class="o"&gt;=&lt;/span&gt; 67
22:         dist &lt;span class="o"&gt;=&lt;/span&gt; calc_dist&lt;span class="o"&gt;(&lt;/span&gt;current_city, city&lt;span class="o"&gt;)&lt;/span&gt;

Please fix only this issue. After fixing, run &lt;span class="sb"&gt;`&lt;/span&gt;cq check &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; llm&lt;span class="sb"&gt;`&lt;/span&gt; to verify.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s it. No test logs, no coverage spam, no unrelated warnings.&lt;/p&gt;

&lt;p&gt;If the error looks like a caller / callee mismatch, we fetch the callee signature to potentially avoid an extra tool-call.&lt;/p&gt;

&lt;h2&gt;
  
  
  The minimal loop
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;smallest complete context → smallest capable model → fewest tool calls → successful edit&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Small, focused context means you can use a small, cheap model and get the fix in 1 second. No tool-calling needed (if you edit yourself):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cq check &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; llm | ollama run qwen3:4b &lt;span class="nt"&gt;--think&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s1"&gt;'show a unified diff to correct this code. Add a one line explanation'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gd"&gt;--- a/src/myapp/calculator.py
&lt;/span&gt;&lt;span class="gi"&gt;+++ b/src/myapp/calculator.py
&lt;/span&gt;&lt;span class="p"&gt;@@ -1,5 +1,5 @@&lt;/span&gt;
 def evaluate(expression):
&lt;span class="gd"&gt;-    return eval(expression)
&lt;/span&gt;&lt;span class="gi"&gt;+    import ast
+    return ast.literal_eval(expression)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation:&lt;/strong&gt; Replaced &lt;code&gt;eval()&lt;/code&gt; with &lt;code&gt;ast.literal_eval()&lt;/code&gt; to safely evaluate strings as Python literals.&lt;/p&gt;

&lt;p&gt;Apply the fix. Run &lt;code&gt;cq&lt;/code&gt; again. Repeat.&lt;/p&gt;

&lt;p&gt;Or with Claude Code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cq check &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; llm | claude &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"fix this"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Tool ordering
&lt;/h2&gt;

&lt;p&gt;In &lt;code&gt;-o llm&lt;/code&gt; mode, the tools are run sequentially, and we stop at the first error.&lt;/p&gt;

&lt;p&gt;In other modes, we run in parralel and cache results for fast re-runs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cq check &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃ Tool             ┃     Time ┃                    Metric ┃ Score   ┃ Status   ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│ compile          │    0.47s │                   compile │ 1.000   │ OK       │
│ ruff             │    0.22s │                      lint │ 1.000   │ OK       │
│ ty               │    0.80s │                type_check │ 1.000   │ OK       │
│ bandit           │    0.53s │                  security │ 1.000   │ OK       │
│ pytest           │    2.11s │                     tests │ 1.000   │ OK       │
│ radon-cc         │    0.34s │                simplicity │ 0.982   │ OK       │
│ radon-mi         │    0.41s │           maintainability │ 0.848   │ OK       │
│ radon-hal        │    0.36s │             file_bug_free │ 0.810   │ OK       │
│ radon-hal        │          │            file_smallness │ 0.655   │ OK       │
│ radon-hal        │          │        functions_bug_free │ 0.808   │ OK       │
│ radon-hal        │          │       functions_smallness │ 0.808   │ OK       │
│ vulture          │    0.37s │                 dead_code │ 1.000   │ OK       │
│ interrogate      │    0.38s │              doc_coverage │ 0.853   │ OK       │
│                  │          │                     Score │ 0.945   │          │
└──────────────────┴──────────┴───────────────────────────┴─────────┴──────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Claude Code stop hook
&lt;/h2&gt;

&lt;p&gt;If you want to auto-run, add a hook to your project's &lt;code&gt;.claude/settings.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"hooks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Stop"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"matcher"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"hooks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cq check . -o score &amp;amp;&amp;amp; echo 'CQ: all clear' || cq check . -o llm; true"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;pass → tiny output&lt;/li&gt;
&lt;li&gt;fail → targeted fix prompt&lt;/li&gt;
&lt;li&gt;loop continues with minimal context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For manual use, create &lt;code&gt;.claude/commands/cq-fix.md&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="si"&gt;$(&lt;/span&gt;cq check &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; llm&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;/cq-fix&lt;/code&gt; embeds the live output directly into the prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv tool &lt;span class="nb"&gt;install &lt;/span&gt;python-code-quality
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Help
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cq check &lt;span class="nt"&gt;--help&lt;/span&gt;

 Usage: cq check &lt;span class="o"&gt;[&lt;/span&gt;OPTIONS] &lt;span class="o"&gt;[&lt;/span&gt;PATH]                                                                                                                                                                                                                                                                                                                                            

 Feed the results from 11+ code quality tools to an LLM. Try: cq check &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; llm

╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│   path      &lt;span class="o"&gt;[&lt;/span&gt;PATH]  Path to Python file or project directory &lt;span class="o"&gt;[&lt;/span&gt;default: .]                                                           │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ &lt;span class="nt"&gt;--output&lt;/span&gt;       &lt;span class="nt"&gt;-o&lt;/span&gt;      &lt;span class="o"&gt;[&lt;/span&gt;table|score|json|llm|raw]  Output mode: table &lt;span class="o"&gt;(&lt;/span&gt;default&lt;span class="o"&gt;)&lt;/span&gt;, score, json, llm                                   │
│ &lt;span class="nt"&gt;--log-level&lt;/span&gt;            TEXT                        Logging level &lt;span class="o"&gt;(&lt;/span&gt;DEBUG, INFO, WARNING, ERROR, CRITICAL&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt;default: CRITICAL]        │
│ &lt;span class="nt"&gt;--clear-cache&lt;/span&gt;                                      Clear cached tool results before running                                         │
│ &lt;span class="nt"&gt;--workers&lt;/span&gt;              INTEGER                     Max parallel workers &lt;span class="o"&gt;(&lt;/span&gt;default: one per tool, use 1 &lt;span class="k"&gt;for &lt;/span&gt;sequential&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt;default: 0]  │
│ &lt;span class="nt"&gt;--language&lt;/span&gt;     &lt;span class="nt"&gt;-l&lt;/span&gt;      TEXT                        Override language detection &lt;span class="o"&gt;(&lt;/span&gt;e.g. python, typescript, rust&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="c"&gt;# FUTURE             │&lt;/span&gt;
│ &lt;span class="nt"&gt;--only&lt;/span&gt;                 TEXT                        Comma-separated tool IDs to run &lt;span class="o"&gt;(&lt;/span&gt;e.g. ruff,ty,pytest&lt;span class="o"&gt;)&lt;/span&gt;                            │
│ &lt;span class="nt"&gt;--skip&lt;/span&gt;                 TEXT                        Comma-separated tool IDs to skip &lt;span class="o"&gt;(&lt;/span&gt;e.g. bandit,vulture&lt;span class="o"&gt;)&lt;/span&gt;                           │
│ &lt;span class="nt"&gt;--exclude&lt;/span&gt;              TEXT                        Comma-separated paths to exclude &lt;span class="o"&gt;(&lt;/span&gt;e.g. demo,docs&lt;span class="o"&gt;)&lt;/span&gt;                                │
│ &lt;span class="nt"&gt;--help&lt;/span&gt;                                             Show this message and exit.                                                      │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python only (for now), but the approach generalizes&lt;/li&gt;
&lt;li&gt;No agent/tool orchestration required — just a shell pipeline&lt;/li&gt;
&lt;li&gt;Works with local models or hosted ones&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Repo: &lt;a href="https://github.com/rhiza-fr/py-cq" rel="noopener noreferrer"&gt;github.com/rhiza-fr/py-cq&lt;/a&gt; — MIT, actively maintained.&lt;/p&gt;

&lt;p&gt;Enjoy!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>cli</category>
    </item>
  </channel>
</rss>
