<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Nathan Maine</title>
    <description>The latest articles on Forem by Nathan Maine (@dentity007).</description>
    <link>https://forem.com/dentity007</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3858306%2Fe600f034-eccc-4bb7-a775-189eb2753fe8.png</url>
      <title>Forem: Nathan Maine</title>
      <link>https://forem.com/dentity007</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/dentity007"/>
    <language>en</language>
    <item>
      <title>Gemma 4 After 24 Hours: What the Community Found vs What Google Promised</title>
      <dc:creator>Nathan Maine</dc:creator>
      <pubDate>Fri, 03 Apr 2026 02:31:45 +0000</pubDate>
      <link>https://forem.com/dentity007/-gemma-4-after-24-hours-what-the-community-found-vs-what-google-promised-3a2f</link>
      <guid>https://forem.com/dentity007/-gemma-4-after-24-hours-what-the-community-found-vs-what-google-promised-3a2f</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1o740cx5wp5k082njqhi.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1o740cx5wp5k082njqhi.jpeg" alt=" " width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Google released Gemma 4 yesterday under Apache 2.0. The benchmarks looked incredible. The community went to work. Here's what we're actually seeing.&lt;/p&gt;

&lt;p&gt;I spent the last 24 hours reading through forums, running my own fine-tuning experiments, and collecting reports from dozens of early adopters. This is a summary of the real-world findings, the open questions, and where I think this model family lands.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Good News First
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Apache 2.0 is a big deal.&lt;/strong&gt; Previous Gemma releases used a custom Google license that technically allowed them to restrict usage. Apache 2.0 removes that uncertainty entirely. For anyone building commercial products on open models, this matters more than any benchmark number.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multilingual quality is genuinely strong.&lt;/strong&gt; Users testing German, Arabic, Vietnamese, and French are reporting that Gemma 4 outperforms Qwen 3.5 in non-English tasks. One user called it "in a tier of its own" for translation. Another said it "makes translategemma feel outdated instantly." For global enterprise deployments, this is a significant differentiator.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The ELO score tells a different story than benchmarks.&lt;/strong&gt; The 31B model scored 2150 on LMArena, which puts it above GPT-OSS-120B and comparable to GPT-5-mini. But side-by-side benchmark tables show it roughly tying with Qwen 3.5 27B. The gap between ELO (human preference) and automated benchmarks suggests Gemma 4 produces responses that humans prefer even when raw accuracy is similar.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The E2B model is absurd.&lt;/strong&gt; Multiple users confirmed that the 2.3B effective parameter model beats Gemma 3 27B on most benchmarks. A user running it on a basic i7 laptop with 32GB RAM reported it was "not only faster, it gives significantly better answers" than Qwen 3.5 4B for finance analysis.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problems Nobody Warned About
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Inference Speed
&lt;/h3&gt;

&lt;p&gt;This is the elephant in the room. Multiple users are reporting that Gemma 4's MoE model (26B-A4B) runs significantly slower than Qwen 3.5's equivalent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One user: &lt;strong&gt;11 tokens/sec on Gemma 4 26B-A4B vs 60+ tokens/sec on Qwen 3.5 35B-A3B&lt;/strong&gt; on the same 5060 Ti 16GB&lt;/li&gt;
&lt;li&gt;Another confirmed higher VRAM usage for context at the same quantization level&lt;/li&gt;
&lt;li&gt;Someone running on a DGX Spark asked "why is it super slow?" with no clear answer yet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the dense 31B model, users are reporting 18-25 tokens/sec on dual NVIDIA GPUs (5070 Ti + 5060 Ti), which is reasonable but not fast.&lt;/p&gt;

&lt;p&gt;The speed gap against Qwen 3.5 is concerning for production deployments where latency matters.&lt;/p&gt;

&lt;h3&gt;
  
  
  VRAM Consumption
&lt;/h3&gt;

&lt;p&gt;Gemma models have historically been VRAM-hungry for context, and Gemma 4 appears to continue this pattern. One user noted they could only fit Gemma 3 27B Q4 with 20K context on a 5090, while Qwen 3.5 27B Q4 fit with 190K context on the same card.&lt;/p&gt;

&lt;p&gt;For the 256K context window to be useful in practice, you need significantly more VRAM than competing models at the same parameter count.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fine-Tuning Compatibility
&lt;/h3&gt;

&lt;p&gt;As someone who attempted QLoRA fine-tuning within hours of release, I can confirm the tooling is not ready. Three issues hit immediately:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;HuggingFace Transformers&lt;/strong&gt; didn't recognize the &lt;code&gt;gemma4&lt;/code&gt; architecture (required installing from source)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PEFT&lt;/strong&gt; couldn't handle &lt;code&gt;Gemma4ClippableLinear&lt;/code&gt;, a new layer type in the vision encoder (required a monkey-patch)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A new &lt;code&gt;mm_token_type_ids&lt;/code&gt; field&lt;/strong&gt; is required during training even for text-only data (required a custom data collator)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I've filed issues on both huggingface/peft and huggingface/transformers. Both received responses within hours, and a fix for the &lt;code&gt;mm_token_type_ids&lt;/code&gt; issue is already in progress. Unsloth also has day-one support if you prefer that path.&lt;/p&gt;

&lt;p&gt;The community question "how easy is it to fine-tune compared to Gemma 3?" currently has no good answer beyond "harder, but solvable."&lt;/p&gt;

&lt;h3&gt;
  
  
  Stability Questions
&lt;/h3&gt;

&lt;p&gt;One user testing the non-quantized 31B in Google AI Studio reported "infinite loops and no possibility to read text from the image." Another found that the model jailbreaks with basic system prompts. A third reported Mac hard crashes when loading either the 31B or 26B in LM Studio.&lt;/p&gt;

&lt;p&gt;These are early reports and may be resolved with updates, but they're worth noting for anyone considering production deployment.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Benchmark Reality
&lt;/h2&gt;

&lt;p&gt;The community quickly assembled side-by-side comparisons. Here's the consolidated picture:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Gemma 4 31B&lt;/th&gt;
&lt;th&gt;Qwen 3.5 27B&lt;/th&gt;
&lt;th&gt;Winner&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MMLU-Pro&lt;/td&gt;
&lt;td&gt;85.2%&lt;/td&gt;
&lt;td&gt;86.1%&lt;/td&gt;
&lt;td&gt;Qwen&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPQA Diamond&lt;/td&gt;
&lt;td&gt;84.3%&lt;/td&gt;
&lt;td&gt;85.5%&lt;/td&gt;
&lt;td&gt;Qwen&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LiveCodeBench v6&lt;/td&gt;
&lt;td&gt;80.0%&lt;/td&gt;
&lt;td&gt;80.7%&lt;/td&gt;
&lt;td&gt;Tie&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codeforces ELO&lt;/td&gt;
&lt;td&gt;2150&lt;/td&gt;
&lt;td&gt;1899&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Gemma&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TAU2-Bench&lt;/td&gt;
&lt;td&gt;76.9%&lt;/td&gt;
&lt;td&gt;79.0%&lt;/td&gt;
&lt;td&gt;Qwen&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MMMLU&lt;/td&gt;
&lt;td&gt;88.4%&lt;/td&gt;
&lt;td&gt;85.9%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Gemma&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HLE (no tools)&lt;/td&gt;
&lt;td&gt;19.5%&lt;/td&gt;
&lt;td&gt;24.3%&lt;/td&gt;
&lt;td&gt;Qwen&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Gemma 4 wins on competitive coding (ELO) and multilingual (MMMLU). Qwen 3.5 wins on most reasoning benchmarks. Neither is a clear overall winner.&lt;/p&gt;

&lt;p&gt;The honest take from one top commenter: "Gemma 4 ties with Qwen, if not Qwen being slightly ahead. And Qwen 3.5 is more compute efficient too."&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Community Is Waiting For
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;QAT versions.&lt;/strong&gt; Gemma 3 QAT (quantization-aware training) models arrived weeks after the initial release. The community expects the same for Gemma 4, and these will likely improve quantized inference quality significantly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Abliterated/uncensored versions.&lt;/strong&gt; At least one already exists. Multiple users are requesting more. The Apache 2.0 license makes this fully legal now.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Larger models.&lt;/strong&gt; There were rumors of a 120B model that didn't materialize. Several users expressed disappointment. A 100B+ MoE from Google could be transformative.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A 9-12B dense model.&lt;/strong&gt; The gap between E4B (4.5B effective) and 26B MoE leaves a hole in the lineup. Gemma 3's 12B model was popular, and there's no direct upgrade path.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where This Leaves Us
&lt;/h2&gt;

&lt;p&gt;Gemma 4 is not the clear winner the benchmarks suggested. But it's not trying to be.&lt;/p&gt;

&lt;p&gt;The real value proposition is the combination of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Apache 2.0&lt;/strong&gt; (fully permissive, no restrictions)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multilingual excellence&lt;/strong&gt; (best in class for non-English)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Base models available&lt;/strong&gt; (fine-tuning ready on day one)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Size diversity&lt;/strong&gt; (2B to 31B covers edge to server)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Native system prompts and function calling&lt;/strong&gt; (production-ready features)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For English-only, benchmark-optimized, speed-critical deployments, Qwen 3.5 is still the better choice. For multilingual, legally unrestricted, fine-tuning-focused use cases, Gemma 4 has a compelling argument.&lt;/p&gt;

&lt;p&gt;The speed and VRAM issues need to be addressed. The fine-tuning tooling needs a week or two to catch up. And we need QAT quantizations before the smaller models can truly compete on efficiency.&lt;/p&gt;

&lt;p&gt;But make no mistake, releasing a 31B dense model under Apache 2.0 that rivals models 4-10x its size on human preference benchmarks is a significant moment for open AI. Google is finally competing on openness, not just capability.&lt;/p&gt;

&lt;p&gt;I'll be publishing our fine-tuning results (including the day-zero bug fixes) and benchmark comparisons as the training run completes. Follow along if you're interested.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Nathan Maine builds AI systems for regulated industries. He is currently fine-tuning Gemma 4 31B for domain-specific deployment and has filed bug reports on huggingface/peft and huggingface/transformers for day-zero compatibility issues.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>google</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Fine-Tuning Gemma 4 on Day Zero: 3 Bugs We Solved in 30 Minutes</title>
      <dc:creator>Nathan Maine</dc:creator>
      <pubDate>Thu, 02 Apr 2026 20:35:03 +0000</pubDate>
      <link>https://forem.com/dentity007/fine-tuning-gemma-4-on-day-zero-3-bugs-we-solved-in-30-minutes-2ke</link>
      <guid>https://forem.com/dentity007/fine-tuning-gemma-4-on-day-zero-3-bugs-we-solved-in-30-minutes-2ke</guid>
      <description>&lt;p&gt;Google released &lt;a href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/" rel="noopener noreferrer"&gt;Gemma 4&lt;/a&gt; today under Apache 2.0 — their most capable open model family. The 31B dense model scores ~1452 on LMArena with a 256K context window.&lt;/p&gt;

&lt;p&gt;We wanted to fine-tune it immediately. QLoRA on a single NVIDIA B200. It broke three times before training started.&lt;/p&gt;

&lt;p&gt;Here's what happened and how we fixed each one.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bug 1: "Transformers does not recognize this architecture"
&lt;/h2&gt;

&lt;p&gt;The first error hits before the model even loads:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ValueError: The checkpoint you are trying to load has model type `gemma4` 
but Transformers does not recognize this architecture.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; The latest stable Transformers release (5.4.0) shipped before Gemma 4 existed. The &lt;code&gt;gemma4&lt;/code&gt; model type only exists in the dev branch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Install from source.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;git+https://github.com/huggingface/transformers.git
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gets you 5.5.0.dev0 which includes the &lt;code&gt;Gemma4ForConditionalGeneration&lt;/code&gt; class.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time to fix:&lt;/strong&gt; 2 minutes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bug 2: "Target module Gemma4ClippableLinear is not supported"
&lt;/h2&gt;

&lt;p&gt;After installing Transformers from source, the model loads fine. But when PEFT tries to apply LoRA:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ValueError: Target module Gemma4ClippableLinear(
  (linear): Linear4bit(in_features=1152, out_features=1152, bias=False)
) is not supported.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Gemma 4 introduces a new layer type called &lt;code&gt;Gemma4ClippableLinear&lt;/code&gt; for its vision and audio encoders. It wraps &lt;code&gt;nn.Linear&lt;/code&gt; with optional input/output clamping for numerical stability. The catch: it inherits from &lt;code&gt;nn.Module&lt;/code&gt;, not &lt;code&gt;nn.Linear&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;PEFT checks the type of every target module before applying LoRA. Since &lt;code&gt;Gemma4ClippableLinear&lt;/code&gt; isn't &lt;code&gt;nn.Linear&lt;/code&gt;, PEFT rejects it — even though we only want to apply LoRA to the text decoder layers, not the vision encoder.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;exclude_modules&lt;/code&gt; parameter doesn't help either. PEFT runs the type check &lt;em&gt;before&lt;/em&gt; filtering, so excluded modules still need to be recognized types.&lt;/p&gt;

&lt;p&gt;Installing PEFT from source doesn't help either — the support simply doesn't exist yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Monkey-patch &lt;code&gt;Gemma4ClippableLinear&lt;/code&gt; to inherit from &lt;code&gt;nn.Linear&lt;/code&gt; before loading the model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch.nn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers.models.gemma4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;modeling_gemma4&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PatchedClippableLinear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;in_features&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out_features&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;in_features&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out_features&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;use_clipped_linears&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;use_clipped_linears&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;use_clipped_linears&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register_buffer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_min&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tensor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register_buffer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_max&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tensor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register_buffer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_min&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tensor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register_buffer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_max&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tensor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;use_clipped_linears&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;clamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_min&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_max&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;use_clipped_linears&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;clamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_min&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_max&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;

&lt;span class="n"&gt;modeling_gemma4&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Gemma4ClippableLinear&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PatchedClippableLinear&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Place this &lt;strong&gt;before&lt;/strong&gt; any &lt;code&gt;AutoModelForCausalLM.from_pretrained()&lt;/code&gt; call. PEFT now sees the vision encoder layers as standard linear layers and proceeds normally.&lt;/p&gt;

&lt;p&gt;Result: 534M trainable parameters (1.68% of 31.8B total).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time to fix:&lt;/strong&gt; 15 minutes (including reading the Gemma 4 source to understand the layer).&lt;/p&gt;




&lt;h2&gt;
  
  
  Bug 3: "mm_token_type_ids is required"
&lt;/h2&gt;

&lt;p&gt;LoRA applies, data loads, training starts — and immediately crashes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ValueError: `mm_token_type_ids` is required as a model input when training
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Gemma 3 required &lt;code&gt;token_type_ids&lt;/code&gt; during training. Gemma 4 adds a second required field: &lt;code&gt;mm_token_type_ids&lt;/code&gt; (multimodal token type IDs). The model validates their presence in the forward pass, even for text-only training. For text-only inputs, both should be all zeros.&lt;/p&gt;

&lt;p&gt;Standard tokenizers and data collators don't produce &lt;code&gt;mm_token_type_ids&lt;/code&gt;. You need a custom collator.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Add both fields during tokenization and build a custom data collator.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# During tokenization
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;format_chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply_chat_template&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;tokenize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;add_generation_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;tokenized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;truncation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;tokenized&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;token_type_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokenized&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;tokenized&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mm_token_type_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokenized&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;tokenized&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;labels&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenized&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tokenized&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Custom data collator
&lt;/span&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;GemmaCollator&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;object&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__call__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;max_len&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;pad_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pad_token_id&lt;/span&gt;
        &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;attention_mask&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;token_type_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mm_token_type_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;labels&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;pad_len&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;max_len&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pad_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;pad_len&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;attention_mask&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;pad_len&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;token_type_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;max_len&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mm_token_type_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;max_len&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;labels&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;labels&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;pad_len&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tensor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Important: set &lt;code&gt;remove_unused_columns=False&lt;/code&gt; in your training config, or the trainer will strip &lt;code&gt;mm_token_type_ids&lt;/code&gt; before it reaches the model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;training_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SFTConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;...,&lt;/span&gt;
    &lt;span class="n"&gt;dataset_text_field&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;remove_unused_columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;trainer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SFTTrainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;training_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;train_dataset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tokenized_dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;data_collator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;GemmaCollator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Time to fix:&lt;/strong&gt; 5 minutes.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;p&gt;After all three fixes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;31B model training at 4.5s/step&lt;/strong&gt; on a single NVIDIA B200 (192GB)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;534M trainable parameters&lt;/strong&gt; via QLoRA (1.68% of 31.8B)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU utilization: 89%&lt;/strong&gt;, 38GB VRAM used&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Estimated training time: ~7.5 hours&lt;/strong&gt; for 3 epochs on 16K examples&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Total time from "model released" to "training steps running": &lt;strong&gt;under 4 hours&lt;/strong&gt; (including model download).&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Day-zero fine-tuning requires bleeding-edge dependencies.&lt;/strong&gt; Install Transformers and PEFT from source when working with newly released models.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multimodal models have hidden requirements for text-only training.&lt;/strong&gt; Both &lt;code&gt;token_type_ids&lt;/code&gt; and &lt;code&gt;mm_token_type_ids&lt;/code&gt; are validated even when no images or audio are involved.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;PEFT's type checking happens before module filtering.&lt;/strong&gt; Even if you exclude vision modules, they still need to be recognized types. Monkey-patching is a valid workaround until official support lands.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;None of these are avoidable with experience.&lt;/strong&gt; They're day-zero discovery problems. The difference is how fast you solve them.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;Issues filed:&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;[huggingface/peft] Gemma4ClippableLinear not supported&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;[huggingface/transformers] mm_token_type_ids required for text-only fine-tuning&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Both include workarounds and suggested fixes. PRs welcome.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
